This Page Is Inserted by IFW Operations 
and is not a part of the Official Record 

BEST AVAILABLE IMAGES 



Defective images within this document are accurate representations of 
the original documents submitted by the applicant. 

Defects in the images may include (but are not limited to): 



BLACK BORDERS 



TEXT CUT OFF AT TOP, BOTTOM OR SIDES 



FADED TEXT 



ILLEGIBLE TEXT 



SKEWED/SLANTED IMAGES 



COLORED PHOTOS 



BLACK OR VERY BLACK AND WHITE DARK PHOTOS 



GRAY SCALE DOCUMENTS 



IMAGES ARE BEST AVAILABLE COPY. 



As rescanning documents will not correct images, 
please do not report the images to the 
Image Problem Mailbox. 



/ 



» - r Exhibit M of Rockett Declaration 

with Response dated 04/16/04 
In USSN: 09/828,423 



JOURNAL OF 

BIOLUMINESCENCEANDCHEMILUMINESCENa 



Bioluminescence and Chemiluminescence: 
Studies and Applications in 
Biology and Medicine 

Proceedings of the Vth International 
Symposium on Bioluminescence and 
Chemiluminescence 



Editors: 

M. Pazzagli, E. Cadenas, L. J. Kricka, 
A. Roda and P. E. Stanley 



Volume 4 1989 



M WILEY 

Chichester • New York • Brisbane • Toronto - Singapore 



JBCHE7 4(1) 1-64( 
ISSN 0884-3996 



FRQM BIOMEDICAL INFORMATION SERVICE 



7 



CLIN. CHEM. 37/11, 1955-1967 (1991) 



Multianalyte Microspot Immunoassay— Mi 

R. P. Ekins and F. W. Chn 

Throughout the 1970s, controversy centered both on im- 
munoassay "sensitivity" per se and on the relative sensi- 
tivities of labeled antibody (Ab) and labeled analyte meth- 
ods. Our theoretical studies revealed that RIA sensitivities 
could be surpassed only by the use of very high-specific- 
activity nonisotopic labels in "noncompetftrve" designs, 
preferably with monocionaJ antibodies. The time-resolved 
fluorescence methodology known as delfia— developed in 
collaboration with LKB/Wallac— represented the first com- 
mercial "ultrasensitive w nonisotopic technique based on 
these theoretical insights, the same concepts being sub- 
sequently adopted in comparable methodologies relying 
on the use of chemiluminescent and enzyme labels. How- 
ever, high-specific-activity labels also permit the develop- 
ment of "multianalyte" immunoassay systems combining 
ultrasensitive with the simultaneous measurement of tens, 
hundreds, or thousands of analytes in a small biological 
sample. This possibility relies on simple, albeit hitherto- 
unexploited, physicochemical concepts. The first is that all 
immunoassays rely on the measurement of Ab occupancy 
by analyte. The second is that, provided the Ab concentra- 
tion used is "vanishingiy small," fractional Ab occupancy is 
independent of both Ab concentration and sample volume. 
This leads to the notion of "ratiometric" immunoassay, 
involving measurement of the ratio of signals (e.g., fluores- 
cent signals) emitted by two labeled Abs, the first (a 
"sensor" Ab) deposited as a microspot on a solid support, 
the second (a "developing** Ab) directed against either 
occupied or unoccupied binding sites of the sensor Ab. Our 
preliminary studies of this approach have relied on a 
dual-channel scanning-laser confocaJ microscope, permit- 
ting microspots of area 100 ^m 2 or less to be analyzed, 
and implying that an array of 10 6 Ab-containing microspots, 
each directed against a different analyte, could, in princi- 
ple, be accommodated on an area of 1 cm 2 . Although 
measurement of such analyte numbers is unlikely ever to 
be required, the ability to analyze biological fluids for a wide 
spectrum of analytes is likely to transform immunodiagnos- 
tics in the next decade. 

Additional Keyphraaes: ratiometric immunoassays • scanning- 
laser contocat microscope * fiuoroimmunoassey 

Immunoassay and other protein-binding assay meth- 
ods based on the use of radioisotopic labels have played 
a major role in medicine during the past three decades. 



Department of Molecular Endocrinology, University College 
and Middlesei School of Medicine, Mortimer St, London WIN 
8AA, U.K. 

Presented at the 23rd annual Oak Ridge Conference on Ad- 
vanced Analytical Concepts for the Clinical Laboratory, St Louis 
MO, April 1991. 

Received May 6, 1991; accepted August 20, 1991. 



"Compact Disk" of the Future 



Their utility and importance have derived primarily 
from the structural specificity of many reactions be- 
tween binding proteins and analytes and the detectabil- 
ity of isotopically labeled reagents, the latter endowing 
such techniques with "exquisite sensitivity." Recently, 
however, interest has increasingly focused on noniso- 
topic techniques based on identical analytical princi- 
ples, differing only in the nature of the marker used to 
label the reactant (e.g., antibody or antigen), whose 
distribution between reacted ("bound") and unreacted 
("free") fractions constitutes the assay "response." 

The basic aims underlying this interest can be 
broadly classed under four main headings: 

• avoidance of the environmental, legal, economic, and 
practical disadvantages of isotopic techniques (e.g., lim- 
ited shelf life of isotopically labeled reagents, problems 
of radioactive waste disposal, cost and complexity of 
radioisotope counting equipment), particularly those 
impeding the development of, for example, simple diag- 
nostic kits^or home or doctor's office use; 

• achievement of greater assay sensitivity; 

• "direct" measurement of analyte concentrations by 
use of transducer-based "immunosensors"; 

• simultaneous measurement of multiple analytes 
Cmulti analyte assay"). 

In this presentation I will focus primarily on the last 
of these objectives, using this to set out the principles 
underlying our present attempts to develop a new "min- 
iaturized" technology that will permit the simultaneous 
measurement of an unlimited number of analytes in a 
small biological sample such as a single drop of blood. 
However, retention (and, if possible, improvement) of 
the high sensitivities of conventional isotopic tech- 
niques is a basic aim not only of our own studies in this 
area but also of most other endeavors falling under the 
above headings. It is therefore appropriate to preface 
this paper with a discussion of the general principles 
underlying the attainment of high binding-assay sensi- 
tivity. 

Immunoassay Sensitivity: Some Basic Concepts 
Definition of Assay Sensitivity 

The need to establish assay conditions yielding max- 
imal sensitivity underlay the independent construction 
of mathematical theories of immunoassay design by 
both Yalow and Berson (J) and Ekins et al. C2) in the 
course of the original development of these methods in 

the early 1960s. Regrettably, these theoretical studies 

led to a prolonged controversy, arising largely from the 
conflicting concepts of "sensitivity" adopted by the two 
groups (see Figure 1). Briefly, Berson and Yalow, in 
their many publications relating to immunoassay de- 
sign (e.g;, 1, 3), defined sensitivity as the 6lope of the 



1 



FROM BIOMEDICAL INFORMATION SERVICE 




R 

ill 



3}Jtrm B mon uwrtr* 




PRIOSJON 




SjrjumBmon 



Fig. 1. The differing concepts of sensitivity and precision underlying 
radioimmunoassay design theories developed by (left) YaJow and 
Berson (e.g.. 1. 3) and {right) Bona et at. (Z 4) 
Yalow and Berson define assay A as more sensitive because it yields a 
response curve of greater stops. Dona at ai. define assay 8 as more sensitive 
because the imprecision of measurement of zero dose (<r 0 ) is tess. Yalow and 
Berson likewise define an essay system as more precise H it yields a steeper 
response curve when data are plotted on a log dose scale 

response curve relating the fraction or percentage of 
labeled antigen bound (b) to analyte concentration ([H]). 
In contrast, Ekins et al. (e.g., 2, 4) defined sensitivity as 
the (imprecision of measurement of zero dose, this 
quantity being indicative of, and essentially equivalent 
to, the lower limit of detection. 

The key difference between these two definitions 
clearly lies in the dependence of the assay detection 
limit on the error (imprecision) in the measurement of 
the response variable. By neglecting this crucial factor, 
the "response curve slope" definition leads to many 
obvious absurdities. For example, plotting conventional 
RIA data in terms of the response metameter B/F (i.e., 
the bound to free ratio) suggests that assay "sensitivity" 
is increased by increasing the antibody concentration in 
the system; however, the converse conclusion is reached 
if identical data are plotted in terms of F/B (see Figure 
2). Observation of the shape and elopes of response 
curves without detailed error analysis thus constitutes a 
totally misleading guide to optimal immunoassay de- 
sign. This approach has, however, characterized many 
of the studies conducted in the immunoassay field dur- 
ing the past 30 years, and has been the source of much 




Any pio< 



Response curve slope 




Detection limit 



Rg. 2. Schematic representation of RIA dose-response curves 
observed tor high and tow antibody concentrations plotted in terms of 

(toft) the free/bound fraction (F/B); {center} the bound/tree fraction 

(B/F) 

Note thai the low antibody concentration yields a response curve of greater 
atope when the assay response is plotted In lerms of F/B, but of lower alope 
when plotted In terms of B/F. The precision of measurement of zwo dose 
(AD 0 ) is Independent of the coordinate frame used to plot assay data (see 

right) 



_(WED) 2.19^03 U :26/ST. 11: 23/NO. 4862209266 P 7 



mythology. For example, consideration of the Law of 
Mass Action reveals that, when response curves corre- 
sponding to different antibody concentrations are plot- 
ted in terms of b vs IH], the maximal slope at zero dose 
is obtained for a concentration of 0.5/JT (where K is the 
affinity constant), in which circumstance the zero dose 
response (bo) is 33%. This conclusion led to Berson and 
YaWB enunciation of the well-known dictum (which, 
albeit erroneous, is broadly adhered to by many immu- 
noassay practitioners and kit manufacturers) that, to 
maximise RIA sensitivity, the amount of antibody to use 
in the system is that which binds 33% of labeled antigen 
in the absence of unlabeled antigen (1, 3). 

Disagreement regarding the concept of sensitivity 
inevitably led to prolonged dispute regarding immu- 
noassay design (5). However, although it is still common 
to encounter publications in the field that rely solely on 
the response curve slope as a measure of sensitivity, the 
assay detection limit is now widely accepted as the only 
valid indicator of this parameter, and we do not there- 
fore intend to dwell further on this issue here. It is 
nevertheless relevant to an understanding of the "min- 
iaturized" assay methodology described below to empha- 
size that untenable concepts of both sensitivity and 
precision underlie many of the commonly accepted rules 
governing current immunoassay-design practice, some 
of which are contravened in our own approach. 

Basic Immunoassay Designs 

It is likewise important in the present context to 
comprehend the basis of the various types of immunoas- 
says currently in use, and the constraints on the sensi- 
tivities of which they are potentially capable. The radio- 
immunoassay and analogous protein-binding assay 
techniques originally developed for the measurement of 
insulin by Yalow and Berson (#S), and of thyroxin and 
vitamin B 12 by Ekins and Barakat (7, 8\ relied on the 
use of a labeled analyte marker to reveal the products of 
the binding reactions between analyte and binder (Pig. 
ure 3, left). This approach has subsequently often been 
portrayed as relying on "competition" between labeled 
and unlabeled analyte molecules for a limited number of 
protein-binding sites, such assays being frequently re- 
ferred to as w competitive. ,, 

Subsequently, Wide et al. in Sweden (9), followed 
shortly by Miles and Hales in the U.K. (10), developed 
labeled antibody methods (Figure 3, right). These meth- 
ods represented an extension of the labeled reagent" 
methods (utilizing radiolabeled organic compounds such 
as m Mabeled p-iodosulfonyl chloride, [ 3 H]acetic anhy- 
dride, and other similar reagents) devised, during the 
early 1950s, by Keston et al. (J J), Avivi et al. (12\ and 
others for quantifying amino acidB, steroid and thyroid 
hormones, etc. Although radiolabeled antibody methods 
(immunoradiometric assays; irmas) were originally 
claimed (13) to be more sensitive than methods based on 
the use of radiolabeled analyte, these claims were sup- 
ported by neither rigorous theoretical analysis nor per- 
suasive experimental evidence, and for some time re- 
mained controversial. Further doubt on their validity 



•ROM BIOMEDICAL INFORMATION SERVICE 



$ 



(WED) 



_?-lg'03 11:27/87.,, =23/1,0.4862209286 P 



r- 



analyte + aciibody* « anaJytc : antibody* (B) 

* residual analyte 
+ xwiduaJ antibody* (F) 




Measure 'fraction bound* (B) : 
Measure "fraction free" (F) 



Fig. 3. Labeted-analyte (/ert) and labeled-antibody (right) assav 
systems compared * ' 7 

ST** ^ ^ of an anafyia 

i^S^^Tl!!! P reac4ion »'Wyte and antibody 

lnaiyte n its bindinfl chamdenstoca vo-i-vis anttoody). Note that irrMm^^ 

sysenv M toward «rrc Jaawmtag . tmt*m *g£^ ZS££ 
anbbody systems r*iy on observation of an antibody "marker to *. 
produce of tfw bintSne rewSon betw^n v^MtVrL «ntf£!L u .ST^ 6 
the optimal antibody corK»m™to ^qj^ r^^£±£? f^' 
toward Mm when t» "tree" andbooV hiaton *r££u£? E25ft tond * 



was cast by the publication by Rodbard and Weiss in 
1973 <J4) of detailed theoretical studies demonstrating 
that both labeled analyte and labeled antibody methods 
possessed essentially equal sensitivities. (Note- These 
authors suggested that IRMAs might be more sensitive in 
the assay of small polypeptides, in which radioiodine 
incorporation into the antigen molecule was restricted- 
conversely, these assays would be leu sensitive for the 
measurement of antigens of high molecular mass.) Nev- 
ertheless, despite the appearance of this publication, the 
belief that labeled antibody methods per se are intrin- 
sically more sensitive than the corresponding labeled 
analyte methods gained wide acceptance among clinical 
chemists. 

The reason for confusion on this issue is that the 
greater potential sensitivity of certain assay formats is 
not really a consequence of the labeling of antibody as 
opposed to analyte; indeed, the apparent antithesis 
between labeled-analyte and labeled-antibody methods 
diverts attention from the true reasons underlying the 
superior sensitivity of certain assay designs Theoretical 
analysis (see, e.g., 4, IS) reveals that, assuming "per- 
fect" separation of the products of the binding reaction 
(i.e. t nomisclassification of bound and free moieties) the 

maximal sensitiv- 
ity) in a labeled analyte immunoassay invariably tends 
to zero, irrespective of whether the free or bound labeled 
analyte fraction is measured, whereas in labeled-anti- 
body methods the optimal antibody concentration de- 
pends on which labeled-antibody fraction is measured 
(aee Figure 3). If the free (unreacted) antibody fraction is 
measured, the optima] concentration also tends-to zero- 
conversely, if the analyte-bound fraction is measured,' 
the concentration tends to infinity. In short, of the four 
basic measurement strategies available— labeled ana- 
lyte, with measurement of free or bound reaction prod- 
uct, and labeled antibody, also with measurement of 

free or bound product-only one permits, in practice the 
use nf antibodv concentrations anDroachinE infinity 



This particular approach may, for want of a better term, 
be described as "noncompetitive,- although it must be 
emphasized that such terminology involves a departure 
from the original meanings attached ^"competitive" 
and Noncompetitive" when these descriptions were first 
used in the present context Indeed, as discussed below 
assays may be subclassed in this manner when no 
labeled reagent of any kind is involved. 

However, the categorization of immunoassays and 
other binding assays as competitive or noncompetitive 
depending on the binding agent concentration yielding 
maximal assay sensitivity, itself obscures the unoW 
mg reasons for the existence of this divergence in assay 
designs, and may thus be misleading. These reasons 
may be more readily understood if the basic principles of 
such assays are portrayed differently from their custom- 
ary presentation. 

The "Antibody Occupancy Principle- of Immunoassay 

When a "sensor" antibody is introduced into an ana- 
lyte-containing medium, binding sites on the antibody 
are occupied by analyte molecules to a fractional extent 
that reflects both the equilibrium constant governing 
the binding reaction, and the final concentration of free 
analyte present in the mixture. This proposition stems 
unmediately from the Law of Mass Action, which can be 
written as 

[AbAgWfAb] = XlfAgJ {1) 

or as fractional occupancy of antibody binding sites 
given by • 

[AbAgMAb] = K[£AgV(l + K[fAg]) (2) 

where [AbAgl [Ab], [fAb], and [fAg] represent the 
concentrations (at equilibrium) of bound and total anti 
body, and free antibody and antigen (analyte), respec- 
tively, and K = equilibrium constant. The final concen- 
tration of free analyte generally depends on the concen- 
trations of both total analyte and antibody; however 
when total antibody approximates 0.05/JT or less free" 
and total antigen ([Ag]) concentrations do not differ 
significantly, and fractional occupancy of antibodv is 
given by ' 



[AbAgMAb] = K[Ag]/(l + KtAg]) 



(3) 



Assays utilizing this concept have been termed "am- 
bient analyte immunoassay^ (26), fractional occupancy 
being independent of both sample volume and antibodv 
concentration (see below). ' 

All immunoassays essentially depend on measure- 
ment of the "fractional occupancy" of the sensor anti- 
body after its reaction with analyte (Bee Figure 4) 

Techniques relying on the measurement of unoccupied 
antibody binding sites (from which antibody occupancy 
is implicitly deduced by subtraction) necessitate— for 
attainment of maximal sensitivity-the use of sensor 
antibody concentrations tending to zero; these assays 



FROM 
# 



BIOMEDICAL INFORMATION SERVICE 



- (WED)- 2. h9' 03 



1 1 : 28/-ST. 1 l : 23/No. 4862209286 



FnctlcrtMt oceuptrtey pf tntlboty 



1 ♦ K (Ait J 



t 

t 

Rb. 4. The antibody bindinfl-sito occupancy principle of Immunoas- 

may therefore be categorized as "competitive." Con- 
veraely, techniques in which occupied Bites are directly 
measured permit (in principle) the use of relatively hieh 
concentrations c/aenwr antibody and may be described 
as ^competitive.- This difference in assay design 
simply reflects the proposition that, to mimmize errorln 
the measurement, it is generally undesirable to mea- 
sure a small quantity by estimating the difference 
between two large quantities. 

These concepts are illustrated in Figure 5 which 
portrays basic immunoassay formats currently in com- 
mon use. Conventional RIA and other similar 'labeled, 
analyte techniques rely on measurement of unoccupied 
binding sites, generally by back-titration (either simul- 
taneous or sequential) with labeled analyte. but anti- 
idiotypic antibody (reactive only with unoccupied sites 
on the sensor antibody) may be used for the same 
purpose. In the case of single-site labeled-antibody as- 
say^the labeled antibody itself constitutes the sensor 
antibody, after reaction with analyte, this sensor anti- 
body may be separated into occupied and unoccupied 
fracnons through use of (e.g.) an immunosorbent (com- 
K^fr Kf ^ "S^ * aalof • or "^-idiotypic anti- 

S^'hJ ? h t U ' after Beparation, the 

signal gutted by Uheled antibody bound to analyte 
d.e., the "occup,ed» fraction) is measured directlyX 

assay can be classed as "noncompetitive.- Converse'lv 
one measures the labeled antibody not bound to anaJyte 
d.e., that attached to the immunoeorbant), then the 
assay is "competitive." Qe 
Two-site "sandwich" assays are clearly more complex 
because they rely on two antibodies and can be consid- 
ered from two points of view. For our present purposes 
the sohd-phase antibody can be regarded as the "sensor* 1 
antibody, with the labeled antibody enabling the occu- 
pied sensor-antibody binding sites to be distinguished 

Seen from this viewpoint, two-site assays may be 
classed as ^noncompetitive." 

These considerations emphasize that the differences 
in design distmguishing so-called competitive and non- 
competitive methods are essentially unrelated to which 



< 
< 



4^ 

4^ ♦ 

y 
y 



•WN-couprrmve immunoassay"* 





n*f unocc* 










<♦ 


-to 




■to 


-to 




o- 






O 






o- 








"COMPETITIVE IMMUNOASSAY" 



0- 



y I Mi i in q OTirw an 



Fig. 5 Basic competitive and noncompetitive immunoassay design. 

£L b8 !*! 80 """wmpetittve and competitive ^munoLalv/™! 

*"y m antibody binding*!* oeeupkncyto obSv^SLS: 
antibody method* are -ncneociprftto- U oecSed a*£ £££ 

unoccupied slas «r 8 measured. Labeled-art^**!, 1 5^L Wt T 

unoccup** by onaJyto. and a, B ttv^ore j, ^ 



component Of any) of the reaction system is labeled. 
Indeed, in the case of transducer-baaed "immunoaen- 
eors. no component is labeled; nevertheless, the design 
of the immunosensor will differ significantly, depending 
on whether a measurable signal is yielded by occupied 
or unoccupied antibody binding sites situated on £ 
ruriace. In short, the terms "competitive" and "noncom- 
petibve merely reflect alternative approaches to the 
determination of the occupancy of antibody bindS 
sites and lead to differences in the optimal antiS 
concentration required to ininimize the effects of ran- 
dom errors arising in the determination 

Competitive and noncompetitive immunoassays can 
be shown to differ significantly in many of their perfbr- 
msnce characteristic^, including their sensitivities °n 

%£SF tT**' Ae ***** constan t of the 
tTnt^V? * e 8pCC,fic of ^bel are impor- 

tent in determuung sensitivity; however, in practice 
Ae sensitivity of competitive assays is primarily limited 
by the affinity constant of the antibody, whereas the 

specific activity of the label is more important in non- 

competitive systems. In both casea, the "experimental" 
or manipulation" error in the measurement of the 
zeroise, response (Ro) [i.e., the relative error (oM 
arising from pipetting and other operations, bu?nS 
mcluding the statistical signal measure,,/,, !. 



FROM BIOMEDICAL INFORMATION SERVICE 



— iWED>. 2 .„, 03 n=29/ST.n=23/ N O. 4e622092e6 p 



se] is of key importance in determining "potential" 
assay sensitivity (i.e., the sensitivity obtained by assum- 
ing the specific activity of the label to be infinite, 
implying zero error in signal measurement). Thus the 
potential sensitivity of a competitive assay can be 
shown to be ojJKRq, whereas that of a noncompetitive 
assay is given by Ro^/tAbJKRo, where, in the latter 
case, Ro is assumed to represent the labeled antibody 
misclasaified as bound (fbAblo), commonly referred to as 
"nonspecifically bound" antibody. Thus Ity[Ab] - f, the 
fraction of labeled antibody that is nonspecifically 
bound, and R^/lAb]*!^ = fa^IKRo. Assuming that 
the relative error (o^/R^ in the measurement of the 
aero-dose response is approximately identical for both 
competitive and noncompetitive assayB, it is evident 
from this simple analysis that the potential sensitivity 
of noncompetitive methods is greater than that of com- 
petitive methods by the factor f, i.e M by the fraction of 
labeled antibody that ie Nonspecifically bound." For 
example, if the nonspecifically bound fraction is 0.01% 
a noncompetitive strategy is potentially capable of a 
flenaitivity 10 OOO-fold greater than that of a competi- 
tive approach, other factors being equal. 

These findings are summarized in Figure 6 (left) 
which shows the relationships between sensitivity (ex.* 
pressed in terms of molecules per milliliter) and anti- 



Compelilivc Noncompetitive 

^rsv^irg Mabel 



Log fi 
molecules/mL 

4 




0 



• • • •* ■ •» < 
• • * * » » . '-^ 



8 



i I 
10 12 

lag K Ab 



u 




label 
— IRMA 

~HS ELISA 
■MJSERJA 



Fig. 6.TT>eoretically predicted senstyvtties of competitive and nor> 
compettove immunoassay methods (represented by the SD of zero 
analyte measurements, expressed as molecules/mL) Dioned as * 
function of antibody affinity (K) ' ^' 60 •* 6 

Norte: in normrnpetHwe eanc*ieh assays, the antibody affinity referred to k 
thai of the labeled antibody, m the ooopetftiv, essa^ JSone Mmd 
^me t^irnpDon that the upertmenw error (CV) incurred in the r^aW 
merit of the assay respone* (e-g., fraction of »abeied*ntic^boind) b l^ni 

actMty^ knp^ymg thai the em* m the meejuirernerrt ol the tab** oer m kTlrn 

error incurred in counting »«i *sinteoratt>ns for a finite bounl^tirnTTS 

practto). 1^ Irxreaae in aer^ can ba a<^*^ 

curve* ahomi reiata to vaJuw of iionapac^ 

■S^SEW itoru * tob y m,wnten » ~«PMific birrfnoTThe cone, 
apondmg °«abe/ curves denwstrate the much greater tow In senaftrvftv 

(compared with that potentially ettsmabJe) when a radioisotopic marker k 

and the special Vantages c* r>onisatopJc Ubate of hioher *n*«f> 
w*Mty in rmcernpetffve assay designs (particularly « nonspec^c birSnofe 
reduced to 0.1% or lea). Arrows indicate assay wr^XTrLw^w 
nc*compeft*e hmwroassay. based on £ 
fluxooenfc (HS-a**) {28) and radioactive {usoui ^J^KtlT^^ 

n^iminuno«8»y (deuia). the first nonaotopic "unia-sensitive- immunoas- 



body affinity in an optimized competitive (labeled ana- 
lyte) assay. For this analysis, we assume (a) the use of a 
label of infinite specific activity, and (6) the use of ,a «I as 
a label, the radioactivity of the samples -being counted 
for 1 min. Computations of the theoretically optimal 
reagent concentrations (on which calculations repre- 
sented in ..Figure 6 rely) were based on the further 
assumptions that (c) the radioactivity of the antibody- 
bound labeled-analyte fraction was counted and (d) the 
(relative) "experimental error" component in the mea- 
surement of the bound fraction (©yb) was 1%. Given 
these assumptions, the "potential" sensitivity attain- 
able in such an assay is oj/Kb, where K is the affinity 
constant of the antibody. [For example, if the affinity 
constant is 10" I/mol, and o^b is 0.01 (1%), maximal 
assay sensitivity is 10" 14 mol/L, or ~6 x 10 s molecules/ 
mLJ The additional "signal measurement error" arising 
in consequence of counting radioactive samples for a 
finite time implies a loss of assay sensitivity, as shown 
by the upper curve in Figure 6 (left). However, the 
resulting loss in sensitivity is relatively small for anti- 
bodies of affinities <10 12 L/mol, and is negligible for 
antibodies with affinities <10" L/mol. In other words if 
the essayist can accept individual sample counting 
tunes of 1-5 min, little improvement in sensitivity is 
gained bf using alternative labels of higher specific 
activities than 13a I. However, similar considerations 
suggest that radioisotopic labels of much lower specific 
activity than 125 I (e.g. f 3 H) may limit the sensitivitieeof 
the assays (such as steroid assays) in which they are 
used, notwithstanding the use of relatively long sample 
counting times. 

The other main conclusions stemming from such 
analysis are the importance of both minimizing "manip- 
ulation" errors and using antibodies of high bmding 
affinity. For example, an increase in o^b to 3% implies 
an approximate threefold loss in Benaitivity, notwith- 
standing the fact that an assay reoptimized in response 
to the deterioration in operator skill that these numbers 
imply would utilize less antibody and labeled analyte 
thereby partially offsetting the consequences of poor 
pipetting. But the most important conclusion emerging 
from the analysis is the near impossibility, in practice 
of achieving immunoassay sensitivities better than 
about 10 7 moleculea/mL by using a competitive ap- 
proach, irrespective of the nature of the label used if one 
assumes sn upper limit to antibody binding affinities on 
the order of 10" L/mol. 

The results of a similar analysis of the sensitivity 
limitations applying to noncompetitive (two-site) assays 
US) are illustrated in Figure 6 (right). Two sets of 
curves are portrayed here, corresponding to the assump- 
tions of 1% and 0.01% nonspecific binding of labeled 
antibody to the capture-antibody substrate. Such anal- 
ysis likewise yields important conclusions relevant to 
assay design, e.g., the crucial importance of reducing 
nonspecific binding of labeled antibody to an absolute 
minimum. Furthermore, if nonspecific binding is re- 

H»r*H in ~0.ni'*,. iijfct an hitrh wncifmio 



FROM BIOMEDICAL INFORMATION 



SERVICE 



OTEDX 2.4-9-Q3 1 1 : 30/ST. 



U :23/T»0. 4862209286 



P 11 



by using an antibody of* « 10 8 L/mol in an optimized 
noncompetitive assay design as by using an antibody of 
K = 10" iVmol in a competitive method. One of the most 
important conclusions fa that the sensitivities poten- 
tially attainable with high-affinity antibodies (JT >10 10 
Ltool) are beyond the reach of radioiaotcpicaUy based 
methods, which (because of the relatively low apecific 
activities of isotopes such as iaa I) are limited in practice 
to sensitivities of the order of 10 8 -10 7 molecule*** or 
more. In short, although, under certain circumstances 
noncompetitive vm* may be somewhat more sensitive 
than correspond** RIA techniques (assuming the use 
of the same antibody in each methodolo»Otne not™, 
tial advantages sensitivity) *g' J 2Z££. 

KtaTTV? u rCalked 0nly * ***** nonisotopic 
labels of much higher specific activity than 128 I tL 

axe combmed wrth ^ high-affinity antibodies; however 

^TZ**?^ !r en use of antibS 
with affinities of about ltf-itf Lfc^ noni80topic ^ 

may yield a substantial improvement in sensitivity 

^ese theorefaca] conclusions, together with the pub- 
lication by Kdhler and Milstein <JS) of methods of in 

£!Tt f ?l n0Cl0nal ^bodies (2), consti- 
tuted the basis of my laboratory's collaborative develop- 
ment (initiated ^around 1976) with the instrument ^ 
ufacturer LKBAValkc of the time-resolved fluoronS 
^^oassay ^ methodology now known as deuia (19, 
20). This methodology was the first "ultra-sensitive" 
nomotopic immunoassay methodology to be developed 

SLUT" ♦t PPTOach hafl £ub8e q«ntly been 

adopted by many other manufacturers, using a varietv 
of high-apedfic activity labels (Table 1) 

Against this background, let us now turn to the 
development of highly sensitive, n^iniaturSd <W 
spot" immunoassays and multianalyte assay systems. 

™d^rT JrMPOr lmMnoaB9a y- Concept. 
Ambient Anatyte Immunoassay 

Particular attention has been drawn above to the 
specious notion that an antibody concentration appro*, 
gating 0.57* is required to maximize the seasitivi^of 
conventional labeled-antigen assays. This propoaition fa 
implicitly overturned by the development Wdw£? 
immunoassays, which we expect to provide the bask of 
a new generation of binding assay methods. But before 



diB«Bsing this methodology in detail, another basic 
analytical concept must be examined. 

The recognition that all immunoassay* 
re y on measurement of antibody occ^^ ^ 
potentially important type of assay, SnV^vt! 
mununoassay (J6). This name is intended ^ d ^bt 
assay sy^ that, unlike conventional metr**W 
sure theanalyte concentration in the medium to whS, 
™ » "posed, being independent bott i o "sam. 

pie volume and of the amount of itibj? J^Tt 

Law of Mass Action, which leads to the foUowinTeW 
boa, representing the fractional occupancy ^ bv^! 
lyte of antibody binding sites (at S * 

P 2 - F{(MAb]) + (fAnl/IAbD + 1} + rAnMAb] = Q (<) 

where. fAn] = analyte concentration. [Abl = antiborfv 
concentration (both in units of 1/K) * ~ «"W»dy 

JtSSl?** £qU 2 i0n ^ readilv be shown that, for 
antibody concentrations approaching D, F =» [A nJ/(l " 
lAj*]) This conclusion is illustrated in Fi™ 7 il 
which the fractional occupancy of Cuj£££ ? T 
"monoclonal") antibody binding sites in the^Sence of 
various analyte concentrations is plotted aaK S« 
body concentration. When an antibVcnn^SStiW 
less than (say) 0.01/* (the aBtibcTprelSv but nf 
essentially, being coupled to a solid s^^rt fa e^oS 

bona!) occupancy of antibody binding sites sofell rl 
fleets the ambient concentration of anXe? ^^ 
independent of the total amount of antibodv iTt£ 
system. (If, for example, if - 10" uZ a^Lti^ 
bmdmg-aite concentration of 0.01/JT representTo Olx 

^te in the medium but, because the amount boundfa 
small the resulting reduction in the ainbienTcoSL 
turn of analyte is insignificant For J^STT^ 
»ncen^tion of bmding sites of the ■aa£3Srffa£ 

<1%, and the system is therefore effectively indepen 




Enzyme label 

Chemiluminescent label 
Ruorescent label 



Sp»cmc activity 

1 n 8 ?^* W second per 

7.5 x 10* 1 labeled rfto/ecules 
Determined by enzyme "amplifica- 
tion factor and detectabil/ty of 
reaction product 

1 detectable event per labeled 
molecule 



1 E ?^* 8S , i0D of reagent concentrations in terma of Mfr v 
tie effect of generalizing the ^h^Z^^jf^}^ 
assay data. The terms [Abl and lAnl ? , of bmdia « 

that this convention hae^n7dhlrfc^ ll ? ed to indic «« 
They do not rater to mX^ce^Z^ l 6 ^^ *■ 
able with (AbJ and [AnJ. Fa" ^^^1 ^ afl ^ 
affinity (conatant) for anabtetfltfn U ^ Powessea an 

lO^'mol^trepreaented^^to of^A^Ti'/ C °° cen f ratioa <* 
Thus, fractional occup^cv^c^vl KofS 1 (dm f ^nleea) unit 

eal for all antibod^ffi w^f ST* ° D * <,u "l on 4 w Na- 
tion ia adcpl^rc^rTlaU^pT 881 ^^ 1 '^ "^n^- 
be identical for ay^ u^^^'^^^ticn wil] 

anUbody with an affinity rfj?» 1A, 0 UO^ m^^ ,t,0M ° f M 
with an affinity of 10'<> Ltoll iSSi rf ^ Utibod ^ 
affinity of 10» IVmol, t£Sidrf ft^i!? MUbo * r 40 
eij,rea«d in the same nwK """^tion i. 

The term "ambient" ia uaed to indicate „_.-. . 
pancy reflects the analyte concent^ti^whic^a^^ °^ 
site* are exposed, not the amount of^ZSS^ 
..e.. the aystem is independent of samoleWEnS ^ mcubabon 



FROM BIOMEDICAL INFORMATION SERVICE 



<WED)-2. 1^-03 M:31/ST. U: 



23/NO. 4862209286 P i 



> 



o 
a 
o 

c 
o 




§ 

e 

I 



Antibody concentration " A 
Rg. 7. FracUonal antibody bindlng-tfte occtmanoi /r .-^ „ a 
4) Plotted „ a function of antibody biS^i!^" e ? uab ? n 
different vaJuea of analyte (an^en) wSSStoSTT "^if 
pe^maoe bW.no (b, JZ^X ISSfigl^,^ 

AD concomnulont ere expressed In unlti of i/je u«. ^. .. 
eoncerdrettons <0.01/*f (appregrirnstely), th, D^ntarvfL^ antibo< * 

es*entially unaffected by variations In an6**^ r ^^J^ nc y » 
several orders of maoiLde being oovwn^S^!?!!?" eB,nti,n S <w 

'eompeWw" inrnuwassays are coovemiooaitv d^n^T. J** 

dent of sample volume, 

fK The ^ J ncluBio f fi **** *o two further concepts. First 
the anjbbody may be confined to a "microspot" on a sS 

sites within the microspot is <vlK x 10"« x AT wh«7n 

« •»^ Ple 7° 1 T e to Which <*» aicroapot is* exposed 

(m rml^ters) and N = Avogadrt/s numbS (6 

For example, ifv = 1 and JT = i 0 » L/^,, ♦jj'g 



***** In tod Mnato | 





maximum number of binding sites that will cause n~ 

^^^^^ 
Jading sites u solely dependent on the ambient concer/ 
"ratiometnc,", microspot immunoassay. 

Dual-Label Microspot Immunoassay 
After exposure of a microspot of antibodv rW D »^ 

"re 8 left), the probe may he removed and exposed 21 L 
solution containing a high concentration of a^.W 
at- antibody directed against eitherTse«md 
fc.e Jhe occupied site) on the analyte SSk 3fE 
moleculeis large, or against unoccupied bmdSgaiJes on 
£e antibody Jn the case of small I anaJyte^ofeS « 
AW 8, right). The fractional occupancyTf 5 sS£ 
antbody may thus be estimated by mea^rSg tneS 
of sensor and developing antibody iCKtSfaS 

labeling the sensor and the developing aitdbSie?wiS 
different labels. e . g ., . pair of JE*^ S^T£ 
chemiluminescent markers (or even labels ofentf ~£ 
different nature). Fluorescent labels ait po 
ocularly uadul in this eantext becauae^by ^^f 
optical scanning techniques (Figure SUhey p^nTthe 
scanning of arrays of antibody "microspots" Sutod 
over a surface (each microspot direc^air^dtf^ 
ent analyte), so that multiple analyte ^«vb mavS, 
performed simultaneously on the same a^pfe KL5 






mcubtiv with •«Mdiot»fj* c At* 
fctnt. \t*i 




Non<ompetlHve assay 



Competitive assay 



Rp. 8. Microspot Immunoassay: (teft) first fneuhAfi/^ j*u 

•ucn mai only 50% of the occup-ed or unoccupied „ 



rnuM diOMfUlCAL I N FORMAT I ON SERVICE 



tWED ' 2-J9-0S 11:32/87. 11:23^0.4862209286 



P 13 










a ehotoftt 




4 photons 


♦ 




YA 


anit**nsirta jmioooos 


4 


»ni-ieiOTvpi« *miboo> 



Competitive gyjtmm 



F£9. Basic Prinze of duaJ-iabei. ambient anaiyte Immunoassay 
relying on fluorescent-labeled anti bodies 

^^te^^J^S! ^ "■«* "* value of F (SM 

rig. /7 ana depend* sowy on the anaMa oorcantratnn ^u*. ^ 'r^ 

bwr. «xpoMd. The ratio o ^tS^S^^a^^I'^ 

coated (« • monomotocuter bye,) on^^Ts^. °' 

advantages stem from adopting a dual fluorescence 
measurement. For example, neither the amount nor the 
distribution of the sensor antibody within the detector's 
field of view is important, because the ratio of the 
emitted fluorescent signals is unaffected. Likewise flue 
tuations in the intensity of the incident (exdtin«)'lieht 
beam are apt to be of little significance. Theseadvan. 
tages are additional to the basic benefit stemming from 
this approach. i.e., that the necessity of ensuring con- 
stancy of the amount of sensor antibody used in the 
assay system is removed. 

Microspot Immunoassay Sensitivity 

Because the microspot immunoassay methodology 
challenges concepts that have dominated immunoassay 
design theory in the past two to three decades, consid- 
eration of the potential sensitivity attainable by this 
approach is obviously of primary importance. The prop- 
osition that microspot assays may be at least as sensi- 
tive as conventional systems that rely on far larger 
amounts of antibody may readily be demonstrated by 
consideration of a model system. Let us postulate that 
sensor antibody molecules are attached to the surface of 
a solid support such that their binding sites remain 
exposed to the analyte, and that their affinity for the 
anal yte is thereby unchanged. (The antibody concentra- 
tion in the eyetem-the number of binding sites on the 
support divided by the incubation volume— is unaffected 
by such attachment, and antibody occupancy by analvte 
at equilibrium will be identical to that occurring if the 
antibody is distributed uniformly throughout the incu- 
bation mixture.) Let us also suppose that the antibody 
molecules exist as a uniform monolayer. of maximal 
surface density on the support and (to simplify discus- 
sion) are unlabeled. Then a change in the concentration 

of sensor antibody implies a corresponding change in 
the surface area over which the antibody is distributed 
If, for example, the antibody affinity constant is 10" 
L/mol, the total incubation volume is 1 mL, and the 
antibody surface density is 6000 binding sites/im 2 then 



!!n^*r? of 10 " ^ «-e., 0.1 mm*) accommodate, 
antibody bmcung srtes corresponding to a concentrate 
of Q.UK; an area of 0.01 mm 2 corresponds to a <™~T 
tration of 0 01/tf, etc. Let us ta£5^i££ 
exposure of the sensor antibodies to a medium contaS 
ing analyte at a concentration of 0.01/aT (i.e., 6 x 10' 
moleculeataL) we measure ^oncompetitively- the re- 

y (e *- * «Posure to a sec- 

ond, landed, "developing" antibody directed against the 
analyte, forming a typical antibody sandwich). Finally, 
let us suppose that all occupied sites react with the 
developir^antibody, with the latter also binding W 

^X^^* - * surface 

^L^rT COn8ider * e ***** °f a progressive 
reduction ofthe antibody-coated surface area fxaT^g.) 

1 i^f™ ** tlhody ^centration UK) through 
0.1 mm* (0 .VK) to 0.01 mm a (0.01/K) and below. From 
equation 4, the value of F for the 1 mm* area is 4.98 x 

i i u-il? £ f^™™ the number of analyte and 
labeled antibody molecules specifically bound to the 
area is 2.99 x 10' (i.e., about 50% ofthe total analyte 
molecules present), whereas the number of labeled an- 
tibody molecules nonspecifically bound is 10 s . Thus 
assuming tiie field of view ofthe detecting instrument is 
restricted to the area on which the sensor antibody is 
deposited (see Figure 10a), and (provisionally) assuming 
Ae background (or W) ofthe instrument itself to bl 
zero (i.e., the only source of background is the non- 





a. fitlo ol vitw otcriue* art. of »nnt>oa r oeponww 






coraunt; «r., e t anuboc, ototim: S/B falls 



S/B fall* 



C. field or vitw eoiwam; dt«M r of .mrooo, cepowon o.cr„ w 

« represented by square iJ£rt£?EZ ; 9 

If-SttK ZzSfiSE YMSV^r — 

while the field of view remains unchanged, S/B fails coaDng {c > 



fHOM BIOMEDICAL INFORMATION SERVICE 



(WED) 



2-19*03 U:33/ST. 11:23/1*0.4862209286 P 



specifically-bound labeled antibody within the instru- 
ment's field of view), the signal/noise ratio observed for 
the 1 mm 2 area ie -30. Similarly, the value of F for a 0.1 
mm 2 area is 9.02 x 1<T 8 , the number of labeled anti- 
body molecules specifically bound to the area is 5.41 x 
10 6 , the number nonspecifically bound is 10 6 , and the 
signal/noise ratio is -54. Likewise, the signal/noise 
ratio for a 0.01 mm 2 area can be shown to be -59. In 
short, the signal/noise ratio increases as the antibody- 
coated surface area is decreased, approaching a maxi- 
mal (plateau) value of 60 as the area coated with sensor 
antibody falls below 0.01 mm 2 and tends toward zero. 

If, however, a reduction in the antibody-coated area 
were not accompanied by a corresponding reduction in 
the detecting instrument's field of view, the resulting 
reduction in "signal" would nor lead to a corresponding 
decrease in the background generated by nonspecifi- 
cally-bound developing antibody (Figure 106). There- 
fore, although reduction in the coated area would in- 
crease the fractional occupancy of the sensor antibody, 
the signal/noise ratio might either remain constant or 
fall. In these circumstances it might be advantageous to 
increase the coated area. Similarly, if the surface den- 
sity of sensor antibody were decreased (the coated area 
being held constant), similar conclusions would be 
reached (Figure 10c). 

Likewise, if the background signal generated within 
the detecting instrument itself (e.g„ from the photocath- 
ode of a photomultiplier tube used to detect photons 
emitted from the antibody-coated area) were not 2ero, 
and remained constant regardless of the instrument's 
field of view, then a maximum signal/noise ratio would 
also be attained at some optimal value of the antibody- 
coated area, below which the ratio would fall. Because, 
however, one can generally reduce the size of the detector 
(and hence the detector-generated background) at the 
same rate as the size of the signal-emitting area, there is 
no reason— in principle— for the signal/noise 'ratio to 
diminish as the. antibody-coated area is progressively 
reduced toward zero. Thus if we, accept the signal/noise 
ratio as indicative of the precision of the measurement of 
antibody occupancy (and hence of assay sensitivity) 
these considerations suggest that it is advantageous to* 
reduce the antibody-coated surface area (and, concomi- 
tantly, the sensor-antibody concentration) toward zero 
although litUe advantage is likely to accrue fixmi reduc- 
ing the area below 0.01 mm 2 (and thus the antibody 
concentration below 0.01/JT). 

Were the microspot area indeed reduced to zero, both 
signal and noise would likewise also fall to zero (the 
ratio between them nevertheless remaining essentially 
constant), implying that no signal of any kind would, in 
the limit, be recorded. In practice, other statistical 
factors come into play when the number of bdividual 
events (e.g., photons) observed by a detecting instru- 
ment is very low, thus prohibiting a reduction of the 
sensor antibody concentration to zero. The point at 
which the reduction in the antibody-coated area causes 
th» rUt^HV fli^nl to be lost sufficiently to affect the 



precision of the measurement of antibody occupancy 
depends clearly on the specific activity of the labeled 
antibody used to measure the occupied binding sites- the 
higher the specific activity, the smaller the-penniaaible 
area. Thus, given labels of very high specific activity 
one can envision circumstances in which, even in a 
Noncompetitive 0 system, the optimal concentration of 
sensor antibody may be exceedingly low. A more gen- 
eral conclusion is that a variety of factors, including the 
characteristics of the instruments used for measuring 
the labeled antibody (or labeled analyte), influence 
immunoassay design, implying, among other things, the 
virtual impossibility of formulating general rules re- 
garding this. For example, reagent concentrations that 
are optimal for isotopically labeled reagents used with a 
conventional radioisotope counter (possessing a fixed 
background dependent on its basic construction) are 
likely to be entirely different when very high-specific- 
activity labels are used and one has the freedom to tailor 
the measuring instrument to samples of any size. In 
short, certain conclusions based on experience of RIA 
and IRMA techniques may prove misleading when ap- 
plied to nonisotopic methodologies, and should be 
viewed with caution. 

A more detailed theoretical consideration of (noncom- 
petitive) microspot 

immunoassay sensitivity (21) sup 
gests that ^ 



irttti 



x [(6 x lO^Kl + tAb«])l<DXlAb»] (5) 



where D = surface density (binding sitea/^Lm 2 ) ofeensor 
antibody, K = ssensor antibody affinity (L/mol) [Ab*J = 
concentration of labeled antibody in developing' solution 
(expressed in units of VK*, where K* = labeled antibody 
P *»m « minimum detectable surface density 
of labeled antibody (molecules/Mm 2 ), and = assay 
detection limit (moleculea/mL). For example, if [Ab»] = 
1, D = 10 6 moleculesVm 2 , K = 10" L/mol, and Z)» = 
20 molecules/^ 2 , then = 2.4 x 10" moleculetfmL 
- 4 x 10 mol/L and the fractional occupancy of the 
binding sites of the sensor antibody by the minimum 
detectable concentration of analyte is 0.04%. Figure 11 
shows the theoretical assay sensitivities attainable with 
use of sensor antibodies of various affinities, plotted as a 
function of D*^. 

A similar theoretical analysis of competitive micro- 
spot immunoassay indicates that potential sensitivities 
are essentially identical to those attainable with con- 
ventional competitive methodologies. In summary the 
above considerations indicate that the attainment of 
high microspot assay sensitivity requires close packing 
of molecules of sensor antibodies within the microspot 
area, combined with the use of an instrument capable of 

accurately measuring very low surface densities of de- 
veloping antibodies. They also suggest that (o) micro- 
spot assay sensitivities considerably higher than those 
obtainable by conventional isotopically based immu- 
noassays are achievable, and (6) if labels of very high 
specific activity are available, the sensitivities yielded 



FROM BIOMEDICAL INFORMATION SERVI 



CE 



Stnslijvfty 

nwtecuJea/rnL 
(log) 




100 



10 1 

1 0 1 

n * 



0.0 1 



Rg. 11. Theoretically Drede**** o^-w.^ , 
spoMmmunoas^y jotted * " SEX l n ° nc ? mpelilive m ^ 
antfbody density detectable withS^gSLT'"^ Sloping 

Postulated values of captur. anfcody aurt*, ZTZT 
and ot ****** *^co^^lT££j£ ^'^Mm*. 

antttxxjy par mfewi»lo/» 0and1 ™'«»<ia» of fluoroaertnWaoX 

rtnanento used) a*PK£? * measuri ** 
achievable fa ****** 
sign. P 73 of conventional de- 

Anally, we briefly address a fitr#k«. 
sionallv raised in this conSi i e < Ue ? on v «*a- 
tenstics of microspot assays^* ° cW - 

regarding this iTe. ff^SSr ^ * 
sensing antibody, the lower the dS*. nucrospot of 
the velocity of the ■rtSSJlSS^ C0D8trainta « 
that at the limit (i.e «^ 80 

eituated within the mkr3 Jrt ani0Unt ° f »«body 
kinetics of the ^nTp^^^' ^ 
homogeneous liquid-Dhase c„ , observed in a 

bon medium is exceedingly lowX^fJ? tb * lncub «- 
which sensor u<aEj3^& w£?°^ rate at 
spot become o^piedl SSSr^ ^^Jf** 
cumstance than when a relativ.lv fc.fk^ ^ ar ' 
antibody is used, as in , of 
those of noncompetitive design wE? 
in mind the relationship Seen ^ bearin « 

of sensor antibody and ^e^^^^^ 
above, it is readily demoi^^Hi? Tl ° dwcufl sed 

the ratio rises is iSS^S?-*" nte at whi <* 
the antibody J^J^^^^T ^ 
instrumentation whose field of view iZ^SSFT 
-crospot area, the highest ^no^t^ut 

observed (after any selected incubXn TriXhTn ^ 

concentration of sensor antibodv in ' hCnth ' 
<0.01/ff. In v short, contrary p^L t !^ 3tem is 
pression, and to the i^S^J^^ 
immunoassay incubation times require toe u£ «f 
We amounts of antibody, the Syti^t^ 



(WED) 2.4-9'03 t 1 ■ u/<!T 

U.34/ST. 11:23/^0.4862209286 P 15 

proach provides the basis of a**™ ~* . . 

than any curren^avdS "»« 

^ t""^ Con siderations 

Although various hidi-snedfir 
beb are potentially usableSs ^S 
nary studies have 7 S * Te T^' Prelhni - 
fluoroTphors. The simultaneous m^, ronventi ™»l 
fluorescences from sinall^is^^^^ ,? f du ^ 
ushed, and the availabih^S^S^ 
t«a (e.g., the laser scanning ctmS^^T^** 
not specifically designed I Zfr^^L^ 
been useful in demonstrating £^SL^STi S - 
microspot approach, ieasibdity of the 

In laser scanning confocal fluorescence mirr^ 
J"*""" ^ the specimen is fltafiSJb???^ 
^ser^am, the fluorescence phoSHmftted 
area bemg focused in turn on to a deScTtJ^ 
low^iark-current photomultiplier CaTS? 'a^?^ 7 a 
focal" point, the projection «f *k oT' ^ At "n* 
-d ^ack-proK*^^^^^ *** 
(Figure 12). Fluorescence pho^enSd k f"? * 
points thus possess a low nrnk«K?f-^ em i tted at other 
detector. SaS^L^S^ ° f reachin 8 ^ 

ated in a ddhXCfrfl ^ U0Te8Cent Otters situ- 
spontaneously emiS bf "T^, E1 «<«™ 

cathode contribute to £. ? a e . photom ^tiplier photo- 

nistrumente permits the photocaSode^t^ ^ ° f Mch 
- area, and this source of ba^ut^^ 




0 (MOtOflt 



Objective 



12. Schematic diaoram of 0 



*n«boo» nucrotnoi 



I _ l. 



rnuM BIOMEDICAL I N FORMAT I ON SERVICE 



_ J WED \ 2 ' 'I 03 11:35 / ST - 11 :23/NO. 4862209286 



P 1 



to diminish with future improvement in photomultiplier 
design. Other sources of background include fluores- 
cence emitted by components in the optical system, 
which may not, in current instruments, have been 
constructed with background reduction as a prime con- 
sideration. Nevertheless, they detect with high sensitiv- 
ity fluorescent signals. For example, one commercially 
available microscope is claimed to detect fluorescein at a 
density of 10 moleculea^^tm 2 . Most commercially avail- 
able fluorescein isothiocyanate (FTTO-labeled IgG ex- 
hibits a fluorophor/protein ratio of —4; this implies 
detection limit iD^mia) for antibody surface density of 
two or three FITC-labeled IgG molecules per microme- 
ter 2 . This, in turn, implies a theoretical sensitivity for a 
two-site immunoassay of ~ 2-3 x 10 5 analyte molecules 
per milliliter, assuming identical parameter values as 
above, or 2-3 x 10 4 molecules/mL if the sensing anti- 
body has an affinity of 10 18 L/mol. Clearly, sensitivity 
may be increased by loading more fluorophor either 
directly or indirectly onto the antibody. 

Our preliminary studies have relied on a less sensi- 
tive microscope, albeit one possessing facilities for dual- 
fluorescence measurement Its argon laser emits two 
excitation lines at 488 and 514 nm. It is thus particu- 
larly efficient in exciting blue/green-emitting fluoro- 
phores such as FTTC (excitation maximum 492 nm), but 
is less efficient in exciting fluorophores such as Texas 
Red (excitation maximum 596 nm). However, the ratio- 
metric assay principle permits considerable variation in 
detection efficiencies of the two labels because the spe- 
cific activities of the labeled antibody species forming 
the antibody couplets can be chosen to yield signal 
ratios approximating unity. Inefficiency of the argon 
laser in exciting Texas Red is thus not a major handicap 
in this context Though this instrument relies on a 
conventional microscope and not on an optical syBtem 
designed for this purpose (and thus implicitly less sen- 
sitive), it permits quantification of fluorescence signals 
generated from microspots of any selected area. Initial 
studies have revealed that, under conditions that are 
not optimal, the instrument is'capable of detecting —25 
FITC-labeled and (or) 150 Texas Red-labeled IgG mole- 
cules per micrometer 5 , while scanning an area of —50 
fan 2 . 

The development of microspot immunoassays has also 
necessitated closer scrutiny of the mechanisms involved 
in the coupling, of antibodies to solid supports. In the 
present context, these should display a capacity to 
adsorb (in the form of a monolayer)— -or to covalently 
link — a high surface density of antibody combined with 
low intrinsic-signal-generating properties (e.g., low in- 
trinsic fluorescence), thus minimizing background. We 
have examined a number of candidate materials, such 

as polypropylene, Teflon* cellulose and nitrocellulose 

membranes, microtiter plates (clear polystyrene plates; 
black, white, and clear polystyrene plates), glass slides 
and quartz optical fibers coated with 3-<amino propyl) 
triethoxy silane, etc, and Beveral alternative protocols 
for achieving high monolayer coating densities. These 



studies have exposed phenomena neither evident nor of 
importance when antibody binding to solid supports is 
examined at a macroscopic level. Provisionally, we have 
used white Dynatech Microfluor microtiter plates- 
formulated for the detection of low fluorescence signals, 
and yielding high Bignal/noise ratios and high coating 
densities of functional antibodies (-5 x 10 4 IgG mole- 
cules/pan 2 ) r -for assay development, although such 
plates are not ideal. Indeed, deficiencies in the antibody- 
deposition methods used constitute the principal source 
of imprecision in assay results and the limitation in 
sensitivity that this implies. Clearly, this represents an 
area for further study and refinement of current rating 
techniques. 

Notwithstanding the limitations of present instru- 
mentation (which, among other things, does not permit 
the use of time-resolving techniques to distinguish two 
individual fluorescence Bignals either from each other or 
from background fluorescence) and the crudenesa of 
present methods for coupling antibodies onto small 
areas, we have verified the theoretical concepts outlined 
above by comparing the performance of several assays 
when constructed in microspot format and when conven- 
tionally designed. Although unoptimized, ratiometric 
microspot assays have yielded sensitivity values closely 
approaching those of conventional optimized IRMA. As 
an example, the results of a ratiometric assay system for 
thyrotropin, with use of Texas Red- and FITC-labeled 
antibodies, are shown in Figure 13. Bearing in mind the 
well-known limitations of these and other "convention- 
al" fluorophors when used as immunoassay reagent 
labels, such results are encouraging, although further 
work is clearly required to achieve the considerably 
greater sensitivity theoretically predicted with use of 
improved fluorophors, better antibody-microspotting 
techniques, and purpose-built (time-resolving) instru- 
mentation. 

The finding that highly sensitive immunoassays can 
be performed with far smaller amounts of antibody than 



260 -i 



C 

O 
fc 




Souo-phaM Ab coated tot «ii 
Sofid-phM* Ab coated tor tOf**. 



1 10 100 

TSH concentration (mU/L) 



1000 



Fig. 13. Response curve In a dual-labeled microspot ratiometric 
assay of thyrotropin (TSH) with Texas Red-labeled solid-phase 
capture antibody and a developing antibody labeled with bJotin/ 
FITC~avidin 

The FrTC/Teua Rod ratio tor »ach microspot was measured with e scanning 
contocoJ mkrxxooopo, and pkrttod a* ■ function cA TSH concentration In 

mitTt-inL units/L 



FROM BIOMEDICAL INFORMATION SERVICE 



(WED) 



2.19'03 1 1 :36/ST. U :23/NO. 48622092B6 P 



1 7 



are currently used conventionally permits in turn the 
construction of antibody microspot arrays enabling, in 
principle, the simultaneous measurement of thousands 
of different substances in 1-mL samples. In collabora- 
tion with investigators at the Centre for Applied Micro- 
biological Research, Porton Down, U.K., we are pres- 
ently developing various techniques for the creation of 
such arrays. Indeed, similar technologies have recently 
been used for the parallel synthesis of several different 
polypeptides, these enabling 10 000-microspot arrays to 
be constructed on silica chips approximating 1 cm 2 (24). 
Although arrays of this capacity are unlikely. to ever be 
required for conventional diagnostic purposes, we can 
anticipate that the ability to simultaneously measure 
many substances in the same sample will have revolu- 
tionary consequences in medicine and other similar 
areas. In addition, such techniques may ultimately 
permit the individual analysis of the multiple isoforms 
of certain "heterogeneous" analytes (e.g., the glycopro- 
tein hormones), such molecular heterogeneity currently 
presenting a major obstacle to the standardization and 
interpretation of many immunological measurements 
(25). Moreover, although these concepts have been illus- 
trated in an immunoassay context, they are clearly 
applicable to all "binding assays," including those rely- 
ing on the use of DNA probes,, hormone receptors, etc. 
For example, labeled lectins that are specific in their 
reactions with the sugar residues in the oligosaccharide 
chains of glycoprotein molecules may be used, together 
with specific antibodies, to impart additional "structural 
specificity" to sandwich assays (26, 27), possibly over- 
coming the limitations of antibodies per se in regard to 
differentiation of the glycosylation variants of the gly- 
coprotein hormones. 

Summary and Conclusion ,J 

Because of past confusion regarding the concepts of 
precision, sensitivity, accuracy, etc., several erroneous 
concepts have become incorporated within currently 
accepted rules of immunoassay design. In particular, 
much higher antibody concentrations are customarily 
used than are necessary to achieve very high assay 
sensitivity, provided that certain measurement strate- 
gies are adhered to. In this presentation, we have 
attempted to show that, in principle, the highest assay 
sensitivities are obtained by confining a small number 
of sensor antibody molecules onto a very small area in 
the form of a microspot and measuring their occupancy 
by an analyte, by using very high-specific-activity "de- 
veloping* antibody probes, thereby maximizing the sig- 
naiynoise ratio in the determination of sensor antibody 
occupancy. This observation, which contradicts cur- 
rently accepted immunoassay design theory, in turn 
makes possible the measurement of an unlimited num- 
ber of different analytes on a chip of very small surface 
area through the use of, e.g., laser scanning techniques 
closely analogous to those used in compact disk tech- 
niques of sound recording. Extensive experimental stud- 
ies in this area, albeit conducted with relatively crude 
techniques and instrumentation not specifically de- 



signed for these purposes, and therefore not reported in 
detail here, have demonstrated the feasibility of the 
miniaturized antibody microspot approach and the va- 
lidity of the general concepts on which it is based. We 
are therefore confident that this represents the basis of 
a next-generation technology that is likely to have a 
revolutionary impact on all fields involving the use of 
binding assays. 

References 

1. Yalow RS Beraon SA. General principle* of radioimmunoaaaay . 
In: Hay* RL, Goswitz FA, Murphy BEP, eda. Radioisotopes in 

0ak ***** m US Atomic Energy 

Commission, 1968:7-39. 

2. Ekinu RP, Newman B, OTUordan JUL IbuL 59-100 

3. Beraon SA^ Yalow RS. Measurement of hornx>iM»-radioimmu- 
noassay. l^Berson SA. Yalow RS, edj. Methods in inveatigatiYe 
and diagnostic endocrinology, Vol 2A. Amsterdam: North Hol- 
land/Elsevier, 1973:84-136. 

4. EkinsR, Newman B. Theoretical aspects of saturation analysis, 
hr Diczfalusy ■ £, Diczfalusy A t eda. Steroid aaeay by protein 
binding. Ttarohnska symposia on research methods in reproduc- 

^, e , n( ^ rinology ' Stockholm: WHO/Karolinaka Sjukhuset, 
1970:11-30. wj«*uuoow 

5. Ekins RP. Limitations of specific activity. In! Margoulies M ed. 
Protein and polypeptide hormones. Part 3 (Discussions). Amster- 
dam: JbccwpU Medica, 1968:612-$, et seq. : Ekins RP. Concentra- 
tions of tracer and antiserum, time and temperature of incubation 
volume of incubation. Ibid: 672-42. ' 
6- Yalow RS, Berson SA. Immunoassay of endogenous plasma 
insulin in man. J Clin Invert 1960;39:1157. 

7 . Ekins RP. The estimation of thyroxine in human plasma by an 
electpophoretic technique. Clin Chun Acta 1960*5*45a-9 

8. Barakat RM. Ekins RP. Assay of vitamin B 12 in 'blood-* 
- simple method. Lancet 1961;ii:25-6. 

9. Wide L, Bennich H, Johansson SCO. Diagnosis of allergy by an 
in-vitro test for allergen antibodies. Lancet 1967^-1105-7 

10. Miles LEH, Hales CN. Labeled antibodies and immunological 
assay systems. Nature (London) 1968;219:186-9. 

11. Keston AS, Udenfriend S, Caiman RK. Micro-analysis of 

12 j f', S r p30n ^ Tait Whitehead JK. The use of »H 
t-u • 1 ^ l< ? 8 ?5 C hydride m analytical reagents in miao- 
biochemistry. In: Johnston JE, Fairea RA, Millett RJ, eda. Radio- 
isoto^ con/erence, London: Butterworths, 1954:313-23. 
13. Milea LEH, Hales CN. An immunoradiometric aasay ofinsu- 
lin. Op. at. (ret 5), Part 1:61-70. 

n^iiS^M*" Gli ^^matical theory iinmunometric 
(labeled antibody) assay. Anal Biochem 1973-52-10-M 

15. Jackson TM, Marshall NJ, Ekins RP. Optimisation of immu. 
norarLometnc assays la: Hunter WM. Corrie JET, eda. Immu. 
IbSSw^B chenus1, 3'- Edinburgh: Churchill Lavmgstone, 

16. Ekins J^- Measurement of analyte concentration. British 
patent no. 8 224 600, 1983. 

Wide i L. SoUd-phaae antigen-antibody systems. In: Hunter 

rnT \%>T^ ' Radioimmunoassay methods. Edinburgh: 
Churchill Livingstone, 1971:405-12. ^ 

18. Kohler G. Milstain C. Continuous culture of fused cells aecret- 
wSSuS^ Predefined a P ed fi«ty- Nature (London) 

19. Marshall NJ, Dakubu S, Jackeon T, Ekins RP. Pulsed HAL 
Urn* resolved fluoroimmunoaasay. In: Albertini A. Ekins RP, eda! 
Monodonal antibodies and developments in immunoassay 
Amsterdam: Elsevier/North Holland, 1981101-8 

20. Soini E, Lflvgren T. Time-resolved fluorescence of lanthanide 

probes and applications in biotechnology [Review] Crit Rev Anal 

Chem 1987;18:105-54. 

21. Ekins RP. Chu F, Biggart E. The development of microspot. 

labeled antibodies. Anal Chim Acta 1990^27 73-96 

22. White JG. Amo. WB, Fordhun M. An evaluation of confocal 
versus conventional imaging of biological structures by flucres- 



F^OM BIOMEDICAL INFORMATION SERVICE 



__ t ! fED) _ 2 ' l9 _i 03 11 -37/ST. 11 :23/NO. 4862209286 P 



cence light nucroacopy. J Cell Biol 1987^05:41-^. 

23. Ploem JS. New instrumentation for sensitive image analyei* 
of fluoreacence in cells and tiasues. In: Tayer DL, Waggoner AS, 
Lanni F, Murphy R, Birge R» eds. Applicationa of fl uoreaoence in 
the biological sdencea. New York: Alan R Lias, 1986^89-300. 

24. Fodor SPA, Read JLh Pirrung MC, et al. Light-directed, 
spatially addressable parallel chemical synthesis. Science 
1991;251:767-73. 

25. Ekins RP. Immunoassay standardization. In: Kallner A, Magid 
E, Albert W, eds. Improvement of comparability and compatibility 
of laboratory assay results in life sciences. Immunoassay standard* 
ization. Scand J Clin Lab Invest 1991;51(Suppl 205):33-*6. 

26. Kottgen E, Hell B, Muller C, Tauber R. Demonstration of 
glycosylation variants of human fibrinogen, using the new tech- 



nique of glycoprotein lectin immunoeorbent aasay (GLIA). Biol 
Chem Hoppe Seyler 1988;369:1157-66. 

27. Kinoahita N ( Suzuki S, Matsuda Y, Taniguchi N. o-Fetopro- 
tein anubody-lectin enzyme immunoassay to characterise sugar 
chains for the study of liver diseases, Clin, Chim Acta 
1989;179:143-62. 

28. Shalev V, Greenberg GH, McAlpine PJ. Detection of at- 
tograms of antigen by a high sensitivity enzyme-linked immuno- 
sorbent assay (HS-EUSA) using a fiuorogenic substrate. J Immunol 
Methods 1980^8:125. 

29. Harris CC, Yolken RH, Kroken H, Hsu IC. Ultrasensitive 
enzymatic radioimmunoassay: application to detection of cholera 
toxin and rotavirus. Proc Natl Acad Sci USA 1979;76:5336. 



Corrections 



Vol 37, pp. 1447-8: In our desire for rapid publication, 
important errors were introduced into the following 
Technical Brief. The corrected version is here repro- 
duced in its entirety, with our apologies to the authors. 

Rapid Detection of 1717-1G->A Mutation In CFTR Gene 
by PCR-Mediated Site-Directed Mutagenesis, Laura 
Cremonesi,* Manuela Seia, s Carmelina Magnanif and 
Maurizio Ferrari 2 0 Istituto Scientifico H£. Raffaele, 
Lab. Centrale, Milano; 2 Istituti Clin, di Perfenonarnento, 
Lab, di Ricerche Clin., Milano, Italy) 

Until now, among the non-AF508 mutations identified in 
the cystic fibrosis transmembrane conductance regulator 
(CFTR) gene by the Cystic Fibrosis (CF) Genetic Analysis 
Consortium, the ones most frequently seen in our popula- 
tion sample are the 1717-3G-»A mutation (13/144 or 9% of 
the CF chromosomes) and the G542X mutation (16/190 or 
8.4% of the CF chromosomes), both revealed by dot-blot 
hybridization of the polymerase chain reaction (PCR) prod- 
uct with allele-specific oligonucleotides (ASO) probes (J). 

In an attempt to simplify the analysis of the most 
frequent mutations in the CFTR gene, we converted radio- 
labeled ASO detection into restriction endonuclease anal- 
ysis of the amplified product 

A PCR-mediated eite-<iirected mutagenesis (2, 3) to de- 
tect the G542X mutation by generating a novel BstNl site 
in the wild -type sequence had already been suggested (4). 

To detect the J 71 7-1 G— ►A mutation, we designed the 
reverse primer (5 '-CTCTGC AAA CTTGG A GAGGTC-3 ' ) to 
contain a single-base mismatch (T->G), which could create 
a novel AtwII restriction site [G I G(A/T)CC) in the am- 
plified wild-type (WT) allele but not in the CF mutant (M) 
allele: 



WT: WT 1717 

I- 



5' 



TAGGACA GCAGAG 



3* 



ATXTGfr 



.CGTCTC 



Avail site 



M; Af 



5' 



1717 
i 



TAAGACA GCAGAG 



3' 



ATTCTGG CGTCTC 

3» * _ 5* 

mutasenized base of reverse primer 




Fig. 1. Detection ot the 1 71 7-1 A mutation by PCR 

Reactions were earned out with 1 *^)c* genomic DNA In a total volume of 100 
ML containing 10 mmot/U TriaHCl (pH 8.3), 50 mmol/L KCl, 1.S mmoVL 
MgC\?, 0.1 oA gelatin, 200 jimol/L each ot the tour osoxyrtbonucieotide 
triphosphates. 2.6 units ot Taq potyrneraM (Pohdn-Dflw C#tu&, NorwaJk. 
CT). and 100 pmol ol each of the primers. PCR conditions wet© as toOows: 
denaturation at WC f or 1 min. anneaJino at 55 *C tor 30 a. and extension at 
72 X tor 1 min, tor a total of 30 cyciea. PCR products were digested tor 2 h at 
37 *C with 5 U of Avail and elecrophoresed on 3% agaroae-1 % NuSleve gel 
tor t h at 50 V. Band© were made visible by staining the gel with ethidium 
bromide. Lane 1: Haettkflgested pBR322 size marker. Lane 2: normal 
homorygoie. Lane 3: CF patient homozygous for the 1717-1G-*A mutation, 
lane 4; heterozygote earner tor the 1717-1G-*A mutation 



For the forward primer, we used the one made available 
by the CF Gene tic Analysis Consortium to amplify exon 11 
of the CFTR gene: 5 ' -C AACTGTGGTTAAAGC AAT- 
AGTGT-3'. 

Digestion by A vaU enzyme of the PCR product generates 
two fragments of 116- and 21 -bp in the wild- type alleles 
and leaves undigested a 137-bp fragment in the mutant 
alleles (Figure 1). 

By combined analysis for the AF508 mutation (5) (252/ 
470 or 53.6% of the CF chromosomes), 1717-1G~>A, and 
G&42X, about 71% of mutations might he detected by 
nonisotopic analysis of the PCR product, thus allowing a 
faster and easier one-day procedure for carrier screening 
and prenatal testing. 



References 

1. Kerem B, Zielenslri J, Markiewicz D, et al Identification of 
mutations in regions corresponding to the two putative nucleotide 
(ATPH)inding folds of the cystic fibrosis gene. Proc Natl Acad Sd 
USA 1990;87:8447-^51. 

2. Haliaaaoa A Chaxnel JC t Baudis M, Kruh J, Kaplan JC, Kitzis 
A. Modification of enzymatically amplified DNA for the detection 
of point mutations. Nucleic Acids Res 1989;17:3606. 

3. Friedman WE, Highsmith £ Jr. Prior TW, Perry TR, Silverman 

LM. Cystic fibrosis deletion mutation detected by PCR-mediated 

site-directed mutagEnesia [Tech Brief]. Clin Chem 1990;36:695-6 

4. Ng ISL, Pace R, Richard MV, et al. Methods for analysis of 
multiple cystic fibrosis mutations. Hum Genet (in press). 

5. Ferrari M, Cremonesi L. More on detection of cystic fibrosis by 
polymerase chain reaction (Response to Letter]. Clin Chem 
1990;36:1702-3. 



»■ 




EfiffMBTOWpiCAL I NFORMat t nw cm„,.r 



'ST. 1 1 



teliiiical 



In This Issue . . . 

^ _^ . . 

4 

Romberg on Life as Chemistry 

• * * • 

See Page 1&95 

Cyclosporine Monitoring / 

See Pages 1891, 1905 

Clinical Uses of DNA Amplification 

See Pages 1893, 1945, 1983 

CLIAand Cholesterol Testing 

See Page 1938 

American Thyroid Association 
Report 

See Page 2002 



18 



If 




Proc. NatL Acad. Sci. USA 

Vol. 95, pp. 6073-6078, May 1998 

Biochemistry 



Reference 1 of 4 

in Response dated 04/16/04 

In USSN: 09/828,423 



Assessing sequence comparison methods with reliable structurally 
identified distant evolutionary relationships 

Steven E. Brenner* ti, Cyrus Chothia*, and Tim J. P. Hubbard§ 

*MRC Laboratory of Molecular Biology, Hills Road, Cambridge CB2 2QH, United Kingdom; and ^Sanger Centre, Wellcome Trust Genome Campus, Hinxton, 
Cambs CB10 ISA, United Kingdom 

Communicated by David R Davies, National Institute of Diabetes, Bethesda, MD, March 16, 1998 (received for review November 12, 1997) 



ABSTRACT Pairwise sequence comparison methods have 
been assessed using proteins whose relationships are known 
reliably from their structures and functions, as described in 
the SCOP database [Murzin, A. G., Brenner, S. E., Hubbard, T. 
& Chothia C. (1995) /. Mol. BioL 247, 536-540]. The evalua- 
tion tested the programs BLAST [Altschul, S. F., Gish, W., 
Miller, W., Myers, E. W. & Lipman, D. J. (1990)./. MoL BioL 
215, 403-410], WU-BLAST2 [Altschul, S. F. & Gish, W. (1996) 
Methods Enzymol. 266, 460-480], fasta [Pearson, W. R. & 
Lipman, D. J. (1988) Proc. NatL Acad. ScL USA 85, 2444-2448], 
and s search [Smith, T. F. & Waterman, M. S. (1981) /. MoL 
BioL 147, 195-197] and their scoring schemes. The error rate 
of all algorithms is greatly reduced by using statistical scores 
to evaluate matches rather than percentage identity or raw 
scores. The E-value statistical scores of SSEARCH and FASTA are 
reliable: the number Of false positives found in our tests agrees 
well with the scores reported. However, the P-values reported 
by BLAST and WU-BLAST2 exaggerate significance by orders of 
magnitude. SSEARCH, fasta ktup = 1, and WU-BLAST2 perform 
best, and they are capable of detecting almost all relationships 
between proteins whose sequence identities are >30%. For 
more distantly related proteins, they do much less well; only 
one-half of the relationships between proteins with 20-30% 
identity are found. Because many homologs have low sequence 
similarity, most distant relationships cannot be detected by 
any pairwise comparison method; however, those which are 
identified may be used with confidence. 



Sequence database searching plays a role in virtually every 
branch of molecular biology and is crucial for interpreting the 
sequences issuing forth from genome projects. Given the 
method's central role, it is surprising that overall and relative 
capabilities of different procedures are largely unknown. It is 
difficult to verify algorithms on sample data because this 
requires large data sets of proteins whose evolutionary rela- 
tionships are known unambiguously and independently of the 
methods being evaluated. However, nearly all known ho- 
mologs have been identified by sequence analysis (the method 
to be tested). Also, it is generally very difficult to know, in the 
absence of structural data, whether two proteins that lack clear 
sequence similarity are unrelated. This has meant that al- 
though previous evaluations have helped improve sequence 
comparison, they have suffered from insufficient, imperfectly 
characterized, or artificial test data. Assessment also has been 
problematic because high quality database sequence searching 
attempts to have both sensitivity (detection of homologs) and 
specificity (rejection of unrelated proteins); however, these 
complementary goals are linked such that increasing one 
causes the other to be reduced. 



The publication costs of this article were defrayed in part by page charge 
payment. This article must therefore be hereby marked "advertisement" in 
accordance with 18 U.S.C. §1734 solely to indicate this fact. 

© 1998 by The National Academy of Sciences 0027-8424/98/956073-6S2.00/0 
PNAS is available online at http://www.pnas.org. 



Sequence comparison methodologies have evolved rapidly, 
so no previously published tests has evaluated modern versions 
of programs commonly used. For example, parameters in 
blast (1) have changed, and WU-BLAST2 (2) — which produces 
gapped alignments — has become available. The latest version 
of fasta (3) previously tested was 1.6, but the current release 
(version 3.0) provides fundamentally different results in the 
form of statistical scoring. 

The previous reports also have left gaps in our knowledge. 
For example, there has been no published assessment of 
thresholds for scoring schemes more sophisticated than per- 
centage identity. Thus, the widely discussed statistical scoring 
measures have never actually been evaluated on large data- 
bases of real proteins. Moreover, the different scoring schemes 
commonly in use have not been compared. 

Beyond these issues, there is a more fundamental question: 
in an absolute sense, how well does pairwise sequence com- 
parison work? That is, what fraction of homologous proteins 
can be detected using modern database searching methods? 

In this work, we attempt to answer these questions and to 
overcome both of the fundamental difficulties that have hin- 
dered assessment of sequence comparison methodologies. 
First, we use the set of distant evolutionary relationships in the 
SCOP: Structural Classification of Proteins database (4), which 
is derived from structural and functional characteristics (5). 
The SCOP database provides a uniquely reliable set of ho- 
mologs, which are known independently of sequence compar- 
ison. Second, we use an assessment method that jointly mea- 
sures both sensitivity and specificity. This method allows 
straightforward comparison of different sequence searching 
procedures. Further, it can be used to aid interpretation of real 
database searches and thus provide optimal and reliable 
results. 

Previous Assessments of Sequence Comparison. Several 
previous studies have examined the relative performance of 
different sequence comparison methods. The most encom- 
passing analyses have been by Pearson (6, 7), who compared 
the three most commonly used programs. Of these, the Smith- 
Waterman algorithm (8) implemented in SSEARCH (3) is the 
oldest and slowest but the most rigorous. Modern heuristics 
have provided blast (1) the speed and convenience to make 
it the most popular program. Intermediate between these two 
is FASTA (3), which may be run in two modes offering either 
greater speed (ktup = 2) or greater effectiveness (ktup = 1). 
Pearson also considered different parameters for each of these 
programs. 

To test the methods, Pearson selected two representative 
proteins from each of 67 protein superfamilies defined by the 
PIR database (9). Each was used as a query to search the 
database, and the matched proteins were marked as being 
homologous or unrelated according to their membership of PIR 



Abbreviation: EPQ, errors per query. 

tPresent address: Department of Structural Biology, Stanford Uni- 
versity, Fairchild Building D-109, Stanford, CA 94305-5126 

*To whom reprints requests should be addressed, e-mail: brenner@ 
hyper.stanford.edu. 



6073 



6074 Biochemistry: Brenner et ai 



Proc. Natl Acad. ScL USA 95 (1998) 



superfamilies. Pearson found that modern matrices and "In- 
scaling" of raw scores improve results considerably. He also 
reported that the rigorous Smith- Waterman algorithm worked 
slightly better than fast A, which was in turn more effective 
than BLAST. 

Very large scale analyses of matrices have been performed 
(10), and Henikoff and Henikoff (11) also evaluated the 
effectiveness of BLAST and FASTA. Their test with BLAST 
considered the ability to detect homologs above a predeter- 
mined score but had no penalty for methods which also 
reported large numbers of spurious matches. The Henikoffs 
searched the swiss-prot database (12) and used prosite (13) 
to define homologous families. Their results showed that the 
BLOSUM62 matrix (14) performed markedly better than the 
extrapolated PAM-series matrices (15), which previously had 
been popular. 

A crucial aspect of any assessment is the data that are used 
to test the ability of the program to find homologs. But in 
Pearson's and the Henikoffs' evaluations of sequence com- 
parison, the correct results were effectively unknown. This is 
because the superfamilies in PIR and PROSITE are principally 
created by using the same sequence comparison methods 
which are being evaluated. Interdependency of data and 
methods creates a "chicken and egg" problem, and means for 
example, that new methods would be penalized for correctly 
identifying homologs missed by older programs. For instance, 
immunoglobulin variable and constant domains are clearly 
homologous, but PIR places them in different superfamilies. 
The problem is widespread: each superfamily in PIR 48.00 with 
a structural homolog is itself homologous to an average of 1.6 
other PIR superfamilies (16). 

To surmount these sorts of difficulties, Sander and Schnei- 
der (17) used protein structures to evaluate sequence com- 
parison. Rather than comparing different sequence compari- 
son algorithms, their work focused on determining a length- 
dependent threshold of percentage identity, above which all 
proteins would be of similar structure. A result of this analysis 
was the hssp equation; it states that proteins with 25% identity 
over 80 residues will have similar structures, whereas shorter 
alignments require higher identity. (Other studies also have 
used structures (18-20), but these focused on a small number 
of model proteins and were principally oriented toward eval- 
uating alignment accuracy rather than homology detection.) 

A general solution to the problem of scoring comes from 
statistical measures (i.e., E-values and P-values) based on the 
extreme value distribution (21). Extreme value scoring was 
implemented analytically in the BLAST program using the 
Karlin and Altschul statistics (22, 23) and empirical ap- 
proaches have been recently added to fasta and ssearch. In 
addition to being heralded as a reliable means of recognizing 
significantly similar proteins (24, 25), the mathematical trac- 
tability of statistical scores "is a crucial feature of the BLAST 
algorithm" (1). The validity of this scoring procedure has been 
tested analytically and empirically (see ref. 2 and references in 
ref. 24). However, all large empirical tests used random 
sequences that may lack the subtle structure found within 
biological sequences (26, 27) and obviously do not contain any 
real homologs. Thus, although many researchers have sug- 
gested that statistical scores be used to rank matches (24, 25, 
28), there have been no large rigorous experiments on biolog- 
ical data to determine the degree to which such rankings are 
superior. 

A Database for Testing Homology Detection. Since the 
discovery that the structures of hemoglobin and myoglobin are 
very similar though their sequences are not (29), it has been 
apparent that comparing structures is a more powerful (if less 
convenient) way to recognize distant evolutionary relation- 
ships than comparing sequences. If two proteins show a high 
degree of similarity in their structural details and function, it 



is very probable that they have an evolutionary relationship 
though their sequence similarity may be low. 

The recent growth of protein structure information com- 
bined with the comprehensive evolutionary classification in 
the SCOP database (4, 5) have allowed us to overcome previous 
limitations. With these data, we can evaluate the performance 
of sequence comparison methods on real protein sequences 
whose relationships are known confidently. The SCOP database 
uses structural information to recognize distant homologs, the 
large majority of which can be determined unambiguously. 
These superfamilies, such as the globins or the immunoglobu- 
lins, would be recognized as related by the vast majority of the 
biological community despite the lack of high sequence sim- 
ilarity. 

From SCOP, we extracted the sequences of domains of 
proteins in the Protein Data Bank (pdb) (30) and created two 
databases. One (PDB90D-B) has domains, which were all <90% 
identical to any other, whereas (PDB40D-B) had those <40% 
identical. The databases were created by first sorting all 
protein domains in scop by their quality and making a list. The 
highest quality domain was selected for inclusion in the 
database and removed from the list. Also removed from the list 
(and discarded) were all other domains above the threshold 
level of identity to the selected domain. This process was 
repeated until the list was empty. The PDB40D-B database 
contains 1,323 domains, which have 9,044 ordered pairs of 
distant relationships, or ^0.5% of the total 1,749,006 ordered 
pairs. In PDB90D-B, the 2,079 domains have 53,988 relation- 
ships, representing 1.2% of all pairs. Low complexity regions 
of sequence can achieve spurious high scores, so these were 
masked in both databases by processing with the SEG program 
(27) using recommended parameters: 12 1.8 2.0. The databases 
used in this paper are available from http://sss.stanford.edu/ 
sss/, and databases derived from the current version of SCOP 
may be found at http://scop.mrc-Imb.cam.ac.uk/scop/. 

Analyses from both databases were generally consistent, but 
PDB40D-B focuses on distantly related proteins and reduces the 
heavy overrepresentation in the PDB of a small number of 
families (31, 32), whereas PDB90D-B (with more sequences) 
improves evaluations of statistics. Except where noted other- 
wise, the distant homolog results here are from PDB40D-B. 
Although the precise numbers reported here are specific to the 
structural domain databases used, we expect the trends to be 
general. 

Assessment Data and Procedure. Our assessment of se- 
quence comparison may be divided into four different major 
categories of tests. First, using just a single sequence compar- 
ison algorithm at a time, we evaluated the effectiveness of 
different scoring schemes. Second, we assessed the reliability 
of scoring procedures, including an evaluation of the validity 
of statistical scoring. Third, we compared sequence compari- 
son algorithms (using the optimal scoring scheme) to deter- 
mine their relative performance. Fourth, we examined the 
distribution of homologs and considered the power of pairwise 
sequence comparison to recognize them. All of the analyses 
used the databases of structurally identified homologs and a 
new assessment criterion. 

The analyses tested blast (1), version 1.4.9MP, and wu- 
BLAST2 (2), version 2.0a 13MP. Also assessed was the fasta 
package, version 3.0t76 (3), which provided fasta and the 
SSEARCH implementation of Smith-Waterman (8). For 
ssearch and fasta, we used BLOSUM45 with gap penalties 
— 12/— 1 (7, 16). The default parameters and matrix (BLO- 
SUM62) were used for blast and wu-BLAST2. 

The "Coverage Vs. Error" Plot. To test a particular protocol 
(comprising a program and scoring scheme), each sequence 
from the database was used as a query to search the database. 
This yielded ordered pairs of query and target sequences with 
associated scores, which were sorted, on the basis of their 
scores, from best to worst. The ideal method would have 



Biochemistry: Brenner et al 



Proc. Natl Acad. ScL USA 95 (1998) 6075 



Smith-Waterman Scoring Schemes (PDB40D-B) 
1 i i ^ ,n s i y — -r 



Smith-Waterman Scoring Schemes (PDB90D-B) 



ffi 
3 

o 

o 
Q. 

o> 
k_ 

| 0.01 



0.001 




0.05 



0.1 0.15 
Coverage 



0.25 



0.1 ■ 



3 

o 

o 

Q. 

tn 
k. 

| 0.01 



0.001 




0.1 



0.2 0.3 
Coverage 



0.4 



0.5 



Fig. 1. Coverage vs. error plots of different scoring schemes for ssearch Smith-Waterman. (^4) Analysis of PDB40D-B database. (B) Analysis 
of PDB90D-B database. All of the proteins in the database were compared with each other using the ssearch program. The results of this single 
set of comparisons were considered using five different scoring schemes and assessed. The graphs show the coverage and errors per query (EPQ) 
for statistical scores, raw scores, and three measures using percentage identity. In the coverage vs. error plot, the x axis indicates the fraction of 
all homologs in the database (known from structure) which have been detected. Precisely, it is the number of detected pairs of proteins with the 
same fold divided by the total number of pairs from a common superfamily. PDB40D-B contains a total of 9,044 homologs, so a score of 10% indicates 
identification of 904 relationships. The y axis reports the number of EPQ. Because there are 1,323 queries made in the PDB40D-B all-vs.-all 
comparison, 13 errors corresponds to 0.01, or 1% EPQ. They axis is presented on a log scale to show results over the widely varying degrees of 
accuracy which may be desired. The scores that correspond to the levels of EPQ and coverage are shown in Fig. 4 and Table 1. The graph 
demonstrates the trade-off between sensitivity and selectivity. As more homologs are found (moving to the right), more errors are made (moving 
up). The ideal method would be in the lower right corner of the graph, which corresponds to identifying many evolutionary relationships without 
selecting unrelated proteins. Three measures of percentage identity are plotted. Percentage identity within alignment is the degree of identity within 
the aligned region of the proteins, without consideration of the alignment length. Percentage identity within both is the number of identical residues 
in the aligned region as a percentage of the average length of the query and target proteins. The hssp equation (17) is H = 290.15/ -0 * 562 where 
/ is length for 10 < / < 80; H > 100 for / < 10; H = 24.7 for / > 80. The percentage identity Hssp-adjusted score is the percent identity within 
the alignment minus H. Smith-Waterman raw scores and E-values were taken directly from the sequence comparison program. 



perfect separation, with all of the homologs at the top of the 
list and unrelated proteins below. In practice, perfect separa- 
tion is impossible to achieve so instead one is interested in 
drawing a threshold above which there are the largest number 
of related pairs of sequences consistent with an acceptable 
error rate. 

Our procedure involved measuring the coverage and error 
for every threshold. Coverage was defined as the fraction of 
structurally determined homologs that have scores above the 
selected threshold; this reflects the sensitivity of a method. 
Errors per query (EPQ), an indicator of selectivity, is the 
number of nonhomologous pairs above the threshold divided 
by the number of queries. Graphs of these data, called 
coverage vs. error plots, were devised to understand how 




Hemoglobin p-chain (1 hdsb) Cellulase E2 (1 tmlj 



1hdsb GKVDVDWGAQALGR- - L LWY PWTQR F FQ H FGNL S S AGA VMNN PKVKAHGKR VL D A FTQG LKH 
1 tml_ GQVDALHS AAQAAGK I P I LWYNAPGR - - - DCGNHSSGGA PSHS AY -RSWI DEFAAG LKN 

Fig. 2. Unrelated proteins with high percentage identity. Hemo- 
globin /3-chain (pdb code lhds chain b, ref. 38, Left) and cellulase E2 
(pdb code Itml, ref. 39, Right) have 39% identity over 64 residues, a 
level which is often believed to be indicative of homology. Despite this 
high degree of identity, their structures strongly suggest that these 
proteins are not related. Appropriately, neither the raw alignment 
score of 85 nor the E-value of 1.3 is significant. Proteins rendered by 
RASMOL (40). 



protocols compare at different levels of accuracy. These 
graphs share effectively all of the beneficial features of Re- 
ciever Operating Characteristic (ROC) plots (33, 34) but 
better represent the high degrees of accuracy required in 
sequence comparison and the huge background of nonho- 
mologs. 

This assessment procedure is directly relevant to practical 
sequence database searching, for it provides precisely the 
information necessary to perform a reliable sequence database 
search. The EPQ measure places a premium on score consis- 
tency; that is, it requires scores to be comparable for different 
queries. Consistency is an aspect which has been largely 

Percent Identity of Unrelated Proteins (PDB90D-B) 




0 50 100 150 200 

Alignment length 



Fig. 3. Length and percentage identity of alignments of unrelated 
proteins in PDB90D-B: Each pair of nonhomologous proteins found with 
ssearch is plotted as a point whose position indicates the length and 
the percentage identity within the alignment. Because alignment 
length and percentage identity are quantized, many pairs of proteins 
may have exactly the same alignment length and percentage identity. 
The line shows the hssp threshold (though it is intended to be applied 
with a different matrix and parameters). 



6076 Biochemistry: Brenner et ai 



Proc. Natl. Acad. Sci. USA 95 (1998) 



Reliability of Statistical Scores (PDB90D-B) 




1e-06 ~ 1 

0.001 0.01 0.1 1 10 

Errors Per Query 



Fig. 4. Reliability of statistical scores in PDB90D-B: Each line shows 
the relationship between reported statistical score and actual error 
rate for a different program. E-values are reported for ssearch and 
fast a, whereas P-values are shown for blast and wu-blast2. If the 
scoring were perfect, then the number of errors per query and the 
E-values would be the same, as indicated by the upper bold line. 
(P-values should be the same as EPQ for small numbers, and diverges 
at higher values, as indicated by the lower bold line.) E-values from 
ssearch and fasta are shown to have good agreement with EPQ but 
underestimate the significance slightly, blast and wu-blast2 are 
overconfident, with the degree of exaggeration dependent upon the 
score. The results for PDB40D-B were similar to those for pdbwd-b 
despite the difference in number of homologs detected. This graph 
could be used to roughly calibrate the reliability of a given statistical 
score. 

ignored in previous tests but is essential for the straightforward 
or automatic interpretation of sequence comparison results. 
Further, it provides a clear indication of the confidence that 
should be ascribed to each match. Indeed, the EPQ measure 
should approximate the expectation value reported by data- 
base searching programs, if the programs' estimates are accu- 
rate. 

The Performance of Scoring Schemes. All of the programs 
tested could provide three fundamental types of scores. The 
first score is the percentage identity, which may be computed 
in several ways based on either the length of the alignment or 
the lengths of the sequences. The second is a "raw" or 
"Smith- Waterman" score, which is the measure optimized by 
the Smith-Waterman algorithm and is computed by summing 
the substitution matrix scores for each position in the align- 
ment and subtracting gap penalties. In BLAST, a measure 



Sequence Comparison Algorithms (PDB40D-B) 




Coverage 



related to this score is scaled into bits. Third is a statistical 
score based on the extreme value distribution. These results 
are summarized in Fig. 1. 

Sequence Identity. Though it has been long established that 
percentage identity is a poor measure (35), there is a common 
rule-of-thumb stating that 30% identity signifies homology. 
Moreover, publications have indicated that 25% identity can 
be used as a threshold (17, 36). We find that these thresholds, 
originally derived years ago, are not supported by present 
results. As databases have grown, so have the possibilities for 
chance alignments with high identity; thus, the reported cutoffs 
lead to frequent errors. Fig. 2 shows one of the many pairs of 
proteins with very different structures that nonetheless have 
high levels of identity over considerable aligned regions. 
Despite the high identity, the raw and the statistical scores for 
such incorrect matches are typically not significant. The prin- 
cipal reasons percentage identity does so poorly seem to be 
that it ignores information about gaps and about the conser- 
vative or radical nature of residue substitutions. 

From the PDB90D-B analysis in Fig. 3, we learn that 30% 
identity is a reliable threshold for this database only for 
sequence alignments of at least 150 residues. Because one 
unrelated pair of proteins has 43.5% identity over 62 residues, 
it is probably necessary for alignments to be at least 70 residues 
in length before 40% is a reasonable threshold, for a database 
of this particular size and composition. 

At a given reliability, scores based on percentage identity 
detect just a fraction of the distant homologs found by 
statistical scoring. If one measures the percentage identity in 
the aligned regions without consideration of alignment length, 
then a negligible number of distant homologs are detected. 
Use of the hssp equation improves the value of percentage 
identity, but even this measure can find only 4% of all known 
homologs at 1% EPQ. In short, percentage identity discards 
most of the information measured in a sequence comparison. 

Raw Scores. Smith-Waterman raw scores perform better 
than percentage identity (Fig. 1), but ln-scaling (7) provided no 
notable benefit in our analysis. It is necessary to be very precise 
when using either raw or bit scores because a 20% change in 
cutoff score could yield a tenfold difference in EPQ. However, 
it is difficult to choose appropriate thresholds because the 
reliability of a bit score depends on the lengths of the proteins 
matched and the size of the database. Raw score thresholds 
also are affected by matrix and gap parameters. 

Statistical Scores. Statistical scores were introduced partly 
to overcome the problems that arise from raw scores. This 
scoring scheme provides the best discrimination between 
homologous proteins and those which are unrelated. Most 

Sequence Comparison Algorithms (PDB90D-B) 




Coverage 



Fig. 5. Coverage vs. error plots of different sequence comparison methods: Five different sequence comparison methods are evaluated, each 
using statistical scores (E- or P-values). (A) PDB40D-B database. In this analysis, the best method is the slow ssearch, which finds 18% of relationships 
at \% EPQ. fasta ktup = 1 and WU-BLAST2 are almost as good. (B) PDB90D-B database. The quick wu-BLAST2 program provides the best coverage 
at 1% EPQ on this database, although at higher levels of error it becomes slightly worse than fasta ktup = 1 and ssearch. 



Biochemistry: Brenner et al. 



Proc. Natl Acad. Sci. USA 95 (1998) 6077 



likely, its power can be attributed to its incorporation of more 
information than any other measure; it takes account of the 
full substitution and gap data (like raw scores) but also has 
details about the sequence lengths and composition and is 
scaled appropriately. 

We find that statistical scores are not only powerful, but also 
easy to interpret, ssearch and fasta show close agreement 
between statistical scores and actual number of errors per 
query (Fig. 4). The expectation value score gives a good, 
slightly conservative estimate of the chances of the two se- 
quences being found at random in a given query. Thus, an 
E -value of 0.01 indicates that roughly one pair of nonhomologs 
of this similarity should be found in every 100 different queries. 
Neither raw scores nor percentage identity can be interpreted 
in this way, and these results validate the suitability of the 
extreme value distribution for describing the scores from a 
database search. 

The P-values from BLAST also should be directly interpret- 
able but were found to overstate significance by more than two 
orders of magnitude for 1% EPQ for this database. Nonethe- 
less, these results strongly suggest that the analytic theory is 
fundamentally appropriate. WU-BLAST2 scores were more re- 
liable than those from BLAST, but also exaggerate expected 
confidence by more than an order of magnitude at 1% EPQ. 

Overall Detection of Homologs and Comparison of Algo- 
rithms. The results in Fig. 5A and Table 1 show that pairwise 
sequence comparison is capable of identifying only a small 
fraction of the homologous pairs of sequences in PDB40D-B. 
Even SSEARCH with E-values, the best protocol tested, could 
find only 18% of all relationships at a 1% EPQ. blast, which 
identifies 15%, was the worst performer, whereas FASTA 
ktup = 1 is nearly as effective as SSEARCH. FASTA ktup = 2 and 
WU-BLAST2 are intermediate in their ability to detect ho- 
mologs. Comparison of different algorithms indicates that 
those capable of identifying more homologs are generally 
slower. SSEARCH is 25 times slower than BLAST and 6.5 times 
slower than fasta ktup = 1. WU-BLAST2 is slightly faster than 
FASTA ktup = 2, but the latter has more interpretable scores. 

In PDB90D-B, where there are many close relationships, the 
best method can identify only 38% of structurally known 
homologs (Fig. SB). The method which finds that many 
relationships is WU-BLAST2. Consequently, we infer that the 
differences between FASTA kup = 1, SSEARCH, and WU-BLAST2 
programs are unlikely to be significant when compared with 
variation in database composition and scoring reliability. 

Fig. 6 helps to explain why most distant homologs cannot be 
found by sequence comparison: a great many such relation- 
ships have no more sequence identity than would be expected 
by chance. SSEARCH with E-values can recognize >90% of the 
homologous pairs with 30-40% identity. In this region, there 
are 30 pairs of homologous proteins that do not have signif- 
icant E-values, but 26 of these involve sequences with <50 
residues. Of sequences having 25-30% identity, 75% are 
identified by ssearch E-values. However, although the num- 
ber of homologs grows at lower levels of identity, the detection 
falls off sharply: only 40% of homologs with 20-25% identity 



2500 



Distribution and Detection of Homologs (PDB40D-B) 

T 




10 15 20 25 30 
Percentage identity: In both 



Fig. 6. Distribution and detection of homologs in PDB40D-B. Bars 
show the distribution of homologous pairs PDB40D-B according to their 
identity (using the measure of identity in both). Filled regions indicate 
the number of these pairs found by the best database searching method 
(ssearch with E-values) at 1% EPQ. The PDB40D-B database contains 
proteins with <40% identity, and as shown on this graph, most 
structurally identified homologs in the database have diverged ex- 
tremely far in sequence and have <20% identity. Note that the 
alignments may be inaccurate, especially at low levels of identity. Filled 
regions show that ssearch can identify most relationships that have 
25% or more identity, but its detection wanes sharply below 25%. 
Consequently, the great sequence divergence of most structurally 
identified evolutionary relationships effectively defeats the ability of 
pariwise sequence comparison to detect them. 

are detected and only 10% of those with 15-20% can be found. 
These results show that statistical scores can find related 
proteins whose identity is remarkably low; however, the power 
of the method is restricted by the great divergence of many 
protein sequences. 

After completion of this work, a new version of pairwise 
blast was released: blastgp (37). It supports gapped align- 
ments, like WU-BLAST2, and dispenses with sum statistics. Our 
initial tests on BLASTGP using default parameters show that its 
E-values are reliable and that its overall detection of homologs 
was substantially better than that of ungapped BLAST, but not 
quite equal to that of WU-BLAST2. 

CONCLUSION 

The general consensus amongst experts (see refs. 7, 24, 25, 27 
and references therein) suggests that the most effective se- 
quence searches are made by (/) using a large current database 
in which the protein sequences have been complexity masked 
and («) using statistical scores to interpret the results. Our 
experiments fully support this view. 

Our results also suggest two further points. First, the E-val- 
ues reported by FASTA and ssearch give fairly accurate 
estimates of the significance of each match, but the P-values 
provided by blast and WU-BLAST2 underestimate the true 



Table 1. Summary of sequence comparison methods with PDB40D-B 


Method 


Relative Time* 


1% EPQ Cutoff 


Coverage at 1% EPQ 


ssearch % identity: within alignment 


25.5 


>70% 


<0.1 


ssearch % identity: within both 


25.5 


34% 


3.0 


ssearch % identity: Hssp-scaled 


25.5 


35% (hssp + 9.8) 


4.0 


ssearch Smith- Waterman raw scores 


25.5 


142 


10.5 


ssearch E-values 


25.5 


0.03 


18.4 


fasta ktup = 1 E-values 


3.9 


0.03 


17.9 


fasta ktup = 2 E-values 


1.4 


0.03 


16.7 


WU-BLAST2 P-values 


1.1 


0.003 


17.5 


blast P-values 


1.0 


0.00016 


14.8 



*Times are from large database searches with genome proteins. 



6078 Biochemistry: Brenner et al. 

extent of errors. Second, ssearch, WU-BLAST2, and fasta 
ktup = 1 perform best, though blast and fasta ktup = 2 
detect most of the relationships found by the best procedures 
and are appropriate for rapid initial searches. 

The homologous proteins that are found by sequence com- 
parison can be distinguished with high reliability from the huge 
number of unrelated pairs. However, even the best database 
searching procedures tested fail to find the large majority of 
distant evolutionary relationships at an acceptable error rate. 
Thus, if the procedures assessed here fail to find a reliable 
match, it does not imply that the sequence is unique; rather, it 
indicates that any relatives it might have are distant ones.** 



""•Additional and updated information about this work, including 
supplementary figures, may be found at http://sss.stanford.edu/sss/. 



The authors are grateful to Drs. A. G. Murzin, M. Levitt, S. R. Eddy, 
and G. Mitchison for valuable discussion. S.E.B. was principally 
supported by a St. John's College (Cambridge, UK) Benefactors' 
Scholarship and by the American Friends of Cambridge University. 
S.E.B. dedicates his contribution to the memory of Rabbi Albert T. 
and Clara S. Bilgray. 

1. Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, 
D. J. (1990) / Mol Biol 215, 403-410. 

2. Altschul, S. F. & Gish, W. (1996) Methods Enzymol 266, 460- 
480. 

3. Pearson, W. R. & Lipman, D.J. (1988) Proc. Natl. Acad. Sci. USA 
85, 2444-2448. 

4. Murzin, A. G., Brenner, S. E., Hubbard, T. & Chothia, C (1995) 
/ Mol Biol. 247, 536-540. 

5. Brenner, S. E., Chothia, C, Hubbard, T. J. P. & Murzin, A. G. 
(1996) Methods Enzymol. 266, 635-643. 

6. Pearson, W. R. (1991) Genomics 11, 635-650. 

7. Pearson, W. R. (1995) Protein Sci. 4, 1145-1160. 

8. Smith, T. F. & Waterman, M. S. (1981)/ Mol. Biol 147, 195-197. 

9. George, D. G., Hunt, L. T. & Barker, W. C. (1996) Methods 
Enzymol 266, 41-59. 

10. Vogt, G., Etzold, T. & Argos, P. (1995)/ Mol. Biol. 249, 816-831. 

11. Henikoff, S. & Henikoff, J. G. (1993) Proteins 17, 49-61. 

12. Bairoch, A. & Apweiler, R. (1996) Nucleic Acids Res. 24, 21-25. 

13. Bairoch, A., Bucher, P. & Hofmann, K. (1996) Nucleic Acids Res. 
24, 189-196. 

14. Henikoff, S. & Henikoff, J. G. (1992) Proc. Natl. Acad. Set. USA 
89, 10915-10919. 

15. Dayhoff, M., Schwartz, R. M. & Orcutt, B. C. (1978) in Atlas of 
Protein Sequence and Structure, ed. Dayhoff, M. (National Bio- 



Proc. Natl Acad. ScL USA 95 (1998) 

medical Research Foundation, Silver Spring, MD), Vol. 5, Suppl. 
3, pp. 345-352. 

16. Brenner, S. E. (1996) Ph.D. thesis. (University of Cambridge, 
UK). 

17. Sander, C. & Schneider, R. (1991) Proteins 9, 56-68. 

18. Johnson, M. S. & Overington, J. P. (1993) / Mol Biol 233, 
716-738. 

19. Barton, G. J. & Sternberg, M. J. E. (1987) Protein Eng. 1, 89-94. 

20. Lesk, A. M., Levitt, M. & Chothia, C. (1986) Protein Eng. 1, 
77-78. 

21. Arratia, R., Gordon, L. & M, W. (1986) Ann. Stat. 14, 971-993. 

22. Karlin, S. & Altschul, S. F. (1990) Proc. Natl. Acad. Sci. USA 87, 
2264-2268. 

23. Karlin, S. & Altschul, S. F. (1993) Proc. Natl Acad. Sci. USA 90, 
5873-5877. 

24. Altschul, S. F., Boguski, M. S., Gish, W. & Wootton, J. C (1994) 
Nat. Genet. 6, 119-129. 

25. Pearson, W. R. (1996) Methods Enzymol. 266, 227-258. 

26. Lipman, D. J., Wilbur, W. J., Smith, T. F. & Waterman, M. S. 
(1984) Nucleic Acids Res. 12, 215-226. 

27. Wootton, J. C. & Federhen, S. (1996) Methods Enzymol 266, 
554-571. 

28. Waterman, M. S. & Vingron, M. (1994) Stat. Science 9, 367-381. 

29. Perutz, M. F., Kendrew, J. C. & Watson, H. C. ( 1965) / Mol Biol. 
13, 669-678. 

30. Abola, E. E., Bernstein, F. C, Bryant, S. H., Koetzle, T. F. & 
Weng, J. (1987) in Crystallographic Databases: Information Con- 
tent, Software Systems, Scientific Applications, eds. Allen, F. H., 
Bergerhoff, G. & Sievers, R. (Data Comm. Intl. Union Crystal- 
logr., Cambridge, UK), pp. 107-132. 

31. Brenner, S. E., Chothia, C. & Hubbard, T. J. P. (1997) Curr. Opin. 
Struct. Biol 7, 369-376. 

32. Orengo, C, Michie, A., Jones S, Jones D. T, Swindells M. B. & 
Thornton, J. (1997) Structure (London) 5, 1093-1108. 

33. Zweig, M. H. & Campbell, G. (1993) Clin. Chem. 39, 561-577. 

34. Gribskov, M. & Robinson, N. L. (1996) Comput. Chem. 20, 25-33. 

35. Fitch, W. M. (1966) / Mol. Biol. 16, 9-16. 

36. Chung, S. Y. & Subbiah, S. (1996) Structure (London) 4, 1123- 
1127. 

37. Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, 
Z., Miller, W. & Lipman, D. J. (1997) Nucleic Acids Res. 25, 
3389-3402. 

38. Girling, R., Schmidt, W., Jr, Houston, T., Amma, E. & Huisman, 
T. (1979) / Mol Biol. 131, 417-433. 

39. Spezio, M., Wilson, D. & Karplus, P. (1993) Biochemistry 32, 
9906-9916 

40. Sayle, R. A. & Milner- White, E. J. (1995) Trends Biochem. Sci. 
20, 374-376. 



Article No. jmbi. 1999.2661 available online at http://www.idealibrary.com on IDE^L J. Mol. Biol. (1999) 288, 147-164 



TMB 



Reference 2 of 4 
in Response dated 04/16/04 

In USSN: 09/828,423 



The Relationship between Protein Structure and 
Function: a Comprehensive Survey with Application 
to the Yeast Genome 

Hedi Hegyi and Mark Gerstein* 



Department of Molecular 
Biophysics & Biochemistry 
Yale University, 266 Whitney 
Avenue, PO Box 208114 
New Haven, CT, 06520 USA 



For most proteins in the genome databases, function is predicted via 
sequence comparison. In spite of the popularity of this approach, the extent 
to which it can be reliably applied is unknown. We address this issue by 
systematically investigating the relationship between protein function and 
structure. We focus initially on enzymes functionally classified by the 
Enzyme Commission (EC) and relate these to by structurally classified 
domains the SCOP database. We find that the major SCOP fold classes 
have different propensities to carry out certain broad categories of func- 
tions. For instance, alpha /beta folds are disproportionately associated with 
enzymes, especially transferases and hydrolases, and all-alpha and small 
folds with non-enzymes, while alpha + beta folds have an equal tendency 
either way. These observations for the database overall are largely true for 
specific genomes. We focus, in particular, on yeast, analyzing it with many 
classifications in addition to SCOP and EC (i.e. COGs, CATH, MIPS), and 
find clear tendencies for fold-function association, across a broad spectrum 
of functions. Analysis with the COGs scheme also suggests that the func- 
tions of the most ancient proteins are more evenly distributed among 
different structural classes than those of more modem ones. For the data- 
base overall, we identify the most versatile functions, i.e. those that are 
associated with the most folds, and the most versatile folds, associated 
with the most functions. The two most versatile enzymatic functions 
(hydro-lyases and O-glycosyl glucosidases) are associated with seven folds 
each. The five most versatile folds (TIM-barrel, Rossmann, ferredoxin, 
alpha-beta hydrolase, and P-loop NTP hydrolase) are all mixed alpha-beta 
structures. They stand out as generic scaffolds, accommodating from six to 
as many as 16 functions (for the exceptional TIM-barrel). At the conclusion 
of our analysis we are able to construct a graph giving the chance that a 
functional annotation can be reliably transferred at different degrees of 
sequence and structural similarity. Supplemental information is available 
from http://bioinfo.mbb.yale.edu/genome/foldfunc. 

1999 Academic Press 



'Corresponding author 



Keywords: structure-function; fold classification; structural convergence; 
functional divergence; yeast genomics 



Introduction 

The problem of determining function 
from sequence 

An ultimate goal of genome analysis is to deter- 
mine the biological function of all the gene pro- 



Abbreviations used: EC, Enzyme Commission; ORF 
open reading frame. 

E-mail address of the corresponding author: 
Mark.Gerstein@yale.edu 



ducts in a genome. However, the function of only 
a minor fraction of proteins has been studied 
experimentally, and, typically, prediction of func- 
tion is based on sequence similarity with proteins 
of known function. That is, functional annotation 
is transferred based on similarity. Unfortunately, 
the relationship between sequence similarity and 
functional similarity is not as straightforward. 
This has been commented on in numerous 
reviews (Bork & Koonin, 1998; Karp, 1998). Karp 
(1998), in particular, has noted that transferring 
of incorrect functional information threatens to 



0022-2836/99/160147-18 $30.00/0 



© 1999 Academic Press 



148 



Relationship between Protein Structure and Function 




5 ISOmerase 

xylose isom erase 
idxi 



4 LYase 

enolase 
6enl 




1 OXidoreductase 

aldose reductase 
2acs 



3 Hydrolase 

adenosine deaminase 
1fkw 



Figure 1. Specific example of 
convergent and divergent evol- 
ution. Top, an example of conver- 
gent evolution, showing structures 
of two carbonic anhydrases with 
the same enzymatic function (EC 
number 4.2.1.1), but with different 
folds. The Figure was drawn with 
Molscript (Kraulis, 1991) from 1THJ 
(left-handed beta helix) and 1DMX 
(flat beta sheet). Bottom, an 
example of possible divergent evol- 
ution, the TTM-barrel. This fold 
functions as a generic scaffold cata- 
lyzing 15 different enzymatic func- 
tions. A schematic Figure of the 
TIM-barrel fold is shown with 
numbers in boxes indicating the 
different location of the active site 
in four proteins that have this fold. 
These four proteins, xylose isomer- 
ase, aldose reductase, enolase, and 
adenosine deaminase, carry out 
very different enzymatic functions, 
in four of the main EC classes 
(1.*.*, 3.*.*, 4.*.*, and 5.*.*). They 
have active sites at very different 
locations (identified by the boxed 
numbers in the barrel) yet they all 
share the same fold. 



progressively corrupt genome databases through 
the problem of accumulating incorrect annota- 
tions and using them as a basis for further anno- 
tations, and so on. 

It is known that sequence similarity does confer 
structural similarity. Moreover, there is a well- 
established quantified relationship between the 
extent of similarity in sequence and that in struc- 
ture. First investigated by Chothia & Lesk (1986) 
the similarity between the structures of two pro- 
teins (in terms of RMS) appears to be a monotonic 
function of their sequence similarity. This fact is 
often exploited when two sequences are declared 
related, based on a database search by programs 
such as BLAST or FastA (Altschul et al, 1997; 
Pearson, 1996). Often, the only common element in 
two distantly related protein sequences is their 
underlying structures, or folds. 

Transitivity requires that the well-established 
relationship between sequence and structure, and 
the more indefinite one between sequence and 
function, imply an indefinite relationship between 
structure and function. Several recent papers have 
highlighted this, analyzing individual protein 
superfamilies with a single fold but diverse func- 
tions. Examples include the aldo-keto reductases, a 
large hydrolase superfamily, and the thiol protein 



esterases. The latter include the eye-lens and cor- 
neal crystallins, a remarkable example of functional 
divergence (Bork & Eisenberg, 1998; Bork et al, 
1994; Cooper et al, 1993; Koonin & Tatusov, 1994; 
Seery et al, 1998). 

There are also many classic examples of the con- 
verse: the same function achieved by proteins with 
completely different folds. For instance, even 
though mammalian chymotrypsin and bacterial 
subtilisin have different folds, they both function 
as serine proteases and have the same Ser-Asp-His 
catalytic triad. Other examples include sugar 
kinases, anti-freeze glycoproteins, and lysyl-tRNA 
synthetases (Bork et al, 1993; Chen et al, 1997; 
Doolittle, 1994; Ibba et al, 1997a,b). 

Figure 1 shows well-known examples of each of 
these two basic situations: the same fold but differ- 
ent function (divergent evolution) and the same 
function but different fold (convergent evolution). 

Protein classification systems 

The rapid growth in the number of protein 
sequences and three-dimensional structures has 
made it practical and advantageous to classify pro- 
teins into families and more elaborate hierarchical 
systems. Proteins are grouped together on the 



t 



Relationship between Protein Structure and Function 



149 



basis of structural similarities in the FSSP (Holm & 
Sander, 1998), CATH (Orengo et al, 1997), and 
SCOP databases (Murzin et al, 1995). SCOP is 
based on the judgments of a human expert FSSP, 
on automatic methods, and CATH, on a mixture of 
both. Other databases collect proteins on the basis 
of sequence similarities to one another, e.g. PRO- 
SITE, SBASE, Pfam, BLOCKS, PRINTS and Pro- 
Dom (Attwood et al, 1998; Bairoch et al, 1997; 
Corpet et al, 1998; Fabian et al, 1997; Henikoff 
et al., 1998; Sonnhammer et al., 1997). Several col- 
lections contain information about proteins from a 
functional point of view. Some of these focus on 
particular organisms, e.g. the MIPS functional cata- 
logue and YPD for yeast (Mewes et al, 1997; 
Hodges et al, 1998) and EcoCyc and GenProtEC 
for Escherichia coli (Karp et al, 1998; Riley, 1997). 
Others focus on particular functional aspects in 
multiple organisms, e.g. the WIT and KEGG 
databases, which focus on metabolism and path- 
ways (Selkov et al, 1997; Ogata et al, 1999), the 
ENZYME database, which focuses obviously 
enough on enzymes (Bairoch, 1996), and the 
COGs system, which focuses on proteins con- 
served over phylogenetically distinct species 
(Tatusov et al, 1997). The ENZYME database, in 
particular, contains all the enzyme reactions that 
have an Enzyme Commission (EC) number 
assigned in accordance with the International 
Nomenclature Committee and is cross-referenced 
with Swissprot (Bairoch, 1996; Bairoch & 
Apweiler, 1998; Barrett, 1997). 

Our approach: systematic comparison of 
proteins classified by structure with those 
classified by function 

One of the most valuable operations one can 
do to these individual classification systems is to 
cross-reference and cross-tabulate them, seeing 
how they overlap. We performed such an anal- 
ysis here by systematically interrelating the 
SCOP, Swissprot and ENZYME databases 
(Bairoch, 1996; Bairoch & Apweiler, 1998; Murzin 
et al, 1995). For yeast we also have used the 
MIPS yeast functional catalogue, CATH and 
COGs in our analysis. This enables us to investi- 
gate the relationship between protein function 
and structure in a comprehensive statistical 
fashion. In particular, we investigated the func- 
tional aspects of both divergent and convergent 
evolution, exploring cases where a structure gains 
a dramatically different biochemical function and 
finding instances of similar enzymatic functions 
performed by unrelated structures. 

We concentrated on single-domain Swissprot 
proteins with significant sequence similarity to one 
of the SCOP structural domains. Since most of 
these proteins have a single assigned function, 
comparing them to individual structural domains, 
which can have only one assigned fold, allowed us 
to establish a one-to-one relationship between 
structure and function. 



Recent related work 

This work is following up on several recent 
reports on the relationship between protein struc- 
ture and function. In particular, Martin et al (1998) 
studied the relationship between enzyme function 
and the CATH fold classification. They concluded 
that functional class (expressed by top-level EC 
numbers) is not related to fold, since a few specific 
residues, not the whole fold, determine enzyme 
function. Russell (1998) also focused on specific 
side-chain patterns, arguing that these could be 
used to predict protein function. In a similar 
fashion, Russell et al (1998) identified structurally 
similar "supersites" in superfolds. They estimated 
that the proportion of homologues with different 
binding sites, and therefore with different func- 
tions, is around 10%. In a novel approach, using 
machine learning techniques, des Jardins et al 
(1997) predict purely from the sequence whether a 
given protein is an enzyme and also the enzyme 
class to which it belongs. 

Our work is also motivated by recent work look- 
ing at whether or not organisms are characterized 
by unique protein folds (Frishman & Mewes, 1997; 
Gerstein, 1997, 1998a,b; Gerstein & Hegyi, 1998; 
Gerstein & Levitt, 1997). If function is closely 
associated with fold (in a one-to-one sense), one 
would think that when a new function arose in 
evolution, nature would have to invent a new fold. 
Conversely, if fold and function are only weakly 
coupled, one would expect to see a more uniform 
distribution of folds amongst organisms and a 
high incidence of convergent evolution. In fact, a 
recent study on microbial genome analysis claims 
that functional convergence is quite common 
(Koonin & Galperin, 1997). Another related paper 
systematically searched Swissprot for all such cases 
of what is termed "analogous" enzymes (Galperin 
et al, 1998). 

Our work is also motivated by the recent work 
on protein design and engineering which aims to 
rationally change a protein function, for instance, 
to engineer a reporter function into a binding pro- 
tein (Hellinga, 1997, 1998; Marvin et al, 1997). 

Results 

Overview of the 8937 single-domain matches 

Our basic results were based on simple sequence 
comparisons between Swissprot and SCOP, the 
SCOP domain sequences being used as queries 
against Swissprot. We focused on "mono-func- 
tional" single-domain matches in Swissprot, i.e. 
those singe-domain proteins with only one anno- 
tated function. The detailed criteria used in the 
database searches are summarized in Materials 
and Methods. 

Overall, a little more than a quarter of the pro- 
teins in Swissprot are enzymes, a similar fraction 
are of known structure, and about one-eighth are 
both. (More precisely, of the 69,113 analyzed pro- 



150 



Relationship between Protein Structure and Function 



teins in Swissprot, 19,995 are enzymes, 18,317 are 
structural homologues, and 8205 are both.) About 
half of the fraction of Swissprot that matched 
known structures were "single-domain" and about 
one-third of these were enzymes (8937 and 3359, 
respectively, of 18,317). We focus on these 8937 
single-domain matches here. Notice how these 
numbers also show how the known structures are 
significantly biased towards enzymes: 45% (8205 
out of 18,317) of all the structural homologues are 
enzymes versus 29% (19,995 out of 69,113) for all 
of Swissprot. 

331 observed fold-function combinations 

Figure 2 gives an overview of how the matches 
are distributed amongst specific functions and 
folds. The single-domain matches include 229 of 
the 361 folds in SCOP 1.35, and 91 of the 207 three- 
component enzyme categories in the ENZYME 
database (Bairoch, 1996). Each match combines a 
SCOP fold number on the structural side (columns 
in Figure 2) and a three-component EC category on 
the functional side (rows), with all the non-enzy- 
matic functions grouped together into a single cat- 
egory with the artificial "EC number" of 0.0.0 
(shown in the first row in Figure 2). This results in 
a table where each cell represents a potential fold- 
function combination. The table contains a maxi- 



mum of 21,068 (= 229 x 92) possible fold-function 
combinations (and a minimum of 229 combi- 
nations, assuming only one function for every 
fold). We actually observe 331 of these combi- 
nations (1.6%, shown by the filled-in cells). 

Overall, more than half of the functions are 
associated with at least two different folds, while 
less than half of the folds with enzymatic activity 
have at least two functions (51 out of 91 and 53 out 
of 128, respectively). 

Summarizing the fold-function combinations 
by 42 broad structure-function classes 

As listed in Table 1, folds can be subdivided in 
six broad fold classes (e.g. all-alpha, all-beta, 
alpha/beta, etc.). Likewise, functions can be bro- 
ken into seven main classes, non-enzymes plus six 
enzyme classes, e.g. oxidoreductase, transferase, 
etc. This gives rise to 42 (6 x 7) structure-function 
classes. The way the 21,068 potential fold-function 
combinations are apportioned amongst the 42 
classes is shown in Table 2A. 

Table 2B shows the way the 331 observed combi- 
nations were actually distributed amongst the 42 
classes. Comparing the number of possible combi- 
nations with that observed shows that the most 
densely populated region of the chart is the trans- 
ferase, hydrolase and lyase functions in combi- 



229 Folds 



0) 

E 

>» 

N 
C 

LU 
i 

c 
o 



+ 

CO 



o 
c 



o 

■♦— » 

CO 

E 
>* 

N 
C 

LU 

5) 



NONENZ 



OX i 



TRAN 



HYD < 



LY < 



ISO < 



UG A 



r " V' 
i : 



t ; 



B 



A/B 



A+B 



i 1 1 it in mi i iinnin mum m n mill i ■ i i 



t ii i in u I mini 




.'■!-. 
"i- "[..*! 



i - i - i 



1 *"* I - 



f-..'-,:= 

1 1 — . 

i . 



■■i -i 



t. ! :■ 



r '!i — " i ~r - ■ - 


' — i"TijVr 


■ji ■» v 


: ' r 
f -ji ' 




. L -_L 












• i-.i'! 



i 



I.-' "l 



I 



■{ ^ 4-- 

:. -t 
i : .. 



! . i I, I' 

-■ ..hi I L . 



! 



r.,-1 



t i 



■ T>T. "1 



ft 



• h ! 

".,f;r.i.- 







• f . :• t.'. 



r: 



■4- ; 



■ r rt_- r 



,."lt T_ i - 
! :'t 



-^-j-j+ITl 



■ I ! . 

- rj'"r' ■■• 



t 



■ 1 1 < 



i : : . : . 

r, ' : v ; 



i-.i. 



1 ] 



. i : t 



|i71 

- ' ■ i 

.■-I " 



1 7 -!->: 



.L_i. 

i-.v* 

[ I 

1' 

'i : I 
i i ' 

. ; ! I 

. r h 

n i 

J = 1 = 



• ■ 



T." 



t [ 

r'i : 



■ i -HI. 

h 

! ! ' 
_7. b :t":~" 



I ■ : 



t i 
•I 



T IT 



V.."'-h 
"l ' 



r 1 ... 



I 

■ : 

i 

. j „ 

- !- ' 

■!'. 

hi ■ 



..I !. : 

. .j, 4 i 

■ ■_ - -t — f - [, 



■ - : • !. 



\ 



'HI 

i 

-i>. tn . 

' ' « " 7 r [ 

!,: :.f 



1 .'. rf 



f 



['.':" 
:'f. 



i. | | t 

T»- ■ ■ " 

.■1 • i 



- }- L T 

■ i 1 



I, 

•. T 



ii. -.jl. 

i; 



I 

•i 



i ■ 



■ * t 

■L L.;: 



■ V 

'l .' 



i 



1 + * ' 



•■+ - — ■ - 



> ! 



Figure 2. Overview of all the single-domain matches between proteins in Swissprot 35 and domains in SCOP 1.35. 
Sequences were compared with BLAST using the match criteria described in Materials and Methods. The matches are 
clustered into 92 functions (based on three-component EC numbers), which are arranged on each row, and 229 folds 
(based on SCOP fold numbers), which are arranged on each column. The first row indicates the matches with non- 
enzymes. There are, thus, 21,068 (=92 x 229) possible combinations shown in the Figure. Only the 331 are actually 
observed. These are indicated by filled squares. 



Relationship between Protein Structure and Function 



151 



Table 1. Broad structural and functional categories 



A. Functiotml categories in Swissprot 35 a 



EC category 


Category name 


Abbreviation 


Num. of functions in category 


0.0.0 


Non-enzymes 


NONENZ 


1 


l.V 


Oxidoreductases 


OX 


86 


o * * 


1 I ul lOlCl CIO CO 


TR AN 


28 


3.*.* 


Hydrolases 


HYD 


53 


4** 


Lyases 


LY 


15 


5.*.* 


Isomerases 


ISO 


16 


6.** 


Ligases 


UG 


9 






Total: 


208 


B. Structural classes in SCOP 1.35 


b 






Fold class 


Class name 


Abbreviation 


Num. of folds in class 


1 


All-alpha 


A 


81 


2 


All-beta 


B 


57 


3 


Alpha and beta 


A/B 


70 


4 


Alpha plus beta 


AH-B 


91 


5 


Multi-domain 


MULTI 


19 


6 


Transmembrane 


TM 


9 


7 


Small proteins 


SML 


43 




Total: 


361 



a List of the functional (enzymatic) categories in Swissprot and the abbreviations used here. The values denote the number of 
three-component EC numbers in each category. 
b List of the structural classes in SCOP studied here, and the abbreviations used for the classes. Values denote the number of folds 



in each class in SCOP 1.35. Class 6 is not used in the analysis. 



nation with the alpha /beta fold class. This notion 
is in accordance with the general view that the 
most popular structures among enzymes fall into 
the alpha/beta class. In contrast, matches between 
small folds and enzymes are almost completely 
missing, except for five folds in the oxidoreductase 
category. There are also no all-alpha ligases and 
only one all-alpha isomerase. 

Table 2C and D break down the 331 fold-func- 
tion combinations in Table 2A into either just a 
number of folds or just a number of functions. 
That is, Table 2C lists the number of different folds 
associated with each of the 42 structure-function 
classes (corresponding to the non-zero columns in 
the relevant class in Figure 2), and Table 2D does 
the same thing for functions (non-zero rows in 
Figure 2). Comparing these tables back to the total 
number of combinations (Table 2A) reveals some 
interesting findings, keeping in mind that more 
functions than folds reveals probable divergence 
and that more folds than functions reveals prob- 
able convergence. For instance, the alpha/beta and 
alpha -f- beta fold classes contain similar numbers 
of folds, but the alpha /beta class has relatively 
more functions, perhaps reflecting a greater diver- 
gence. (Specifically, the alpha /beta class has 73 
folds and 56 functions, while the alpha -h beta class 
has 67 folds but only 35 functions.) 

Table 2E shows the number of matching Swis- 
sprot sequences (from the total of 69,113) for each 
of the 42 structure-function classes. The most 
highly populated categories are the all-alpha non- 
enzymes, where 683 of the 1940 matches come 
from globins, and the all-beta non-enzymes, where 
361 of the 1159 Swissprot sequences have matches 
with the immunoglobulin fold. These numbers are, 



obviously, affected by the biases in Swissprot. On 
the other hand, if we compare the total matches in 
Table 2E with the total combinations in Table 2B it 
is clear that the numbers do not directly correlate. 
For instance, fewer hydrolases in Swissprot have 
matches with alpha /beta folds than with alpha 
-f-beta folds (295 versus 452), but the number of 
different combinations in the first case is 30, as 
opposed to only 18 in the second case. This 
suggests that our approach of counting combi- 
nations may not be as affected by the biases in the 
databanks as simply counting matches. 

Table 2F and G give some rough indication of 
the statistical significance of the differences in 
the observed distribution of combinations. In 
Table 2F, using chi-squared statistics, we calculate 
for each individual structure class the chance that 
we could get the observed distribution of fold- 
function combinations over various functional 
classes if fold was not related to function. Then 
in Table 2G, we reverse the role of fold ' and 
function, and calculate the statistics for each 
functional class. 

Enzyme versus non-enzyme folds 

On the coarsest level, function can be divided 
amongst enzymes and non-enzymes. Of the 229 
folds present in Figure 2, 93 are associated only 
with enzymes and 101 are associated only with 
non-enzymes. The remaining folds were associated 
with both enzymatic and non-enzymatic activity. 
Finally, of the 93 purely enzymatic folds, 18 have 
multiple enzymatic functions. 

Figure 3(a) shows a graphical view of the distri- 
bution of the different fold classes among these 



152 



Relationship between Protein Structure and Function 



broadest functional categories. The distribution is 
far from uniform. The all-alpha fold class has 30 
non-enzymatic representatives, but only 12 purely 
enzymatic folds and four folds with "mixed" (both 
types of) functions. This implies that a protein with 
an all-alpha fold has a priori roughly twice the 
chance of having a non-enzymatic function over an 



enzymatic one. The all-beta fold class has six enzy- 
matic, 17 non-enzymatic and 13 mixed folds. In the 
alpha /beta class, 34 folds are associated only with 
enzymes and five folds only with non-enzymes, 
whereas in the alpha + beta class this ratio is more 
balanced, 28 "purely" enzymatic folds versus 22 
purely non-enzymatic ones. 



Table 2. Statistics over 42 structure-function classes 



A. Number of possible combinations between folds and functions in each of 42 classes (number of cells in Figure 2) 

A B A/B A + B MULT1 SML Sum 



NONENZ 


46 


36 


48 


56 


15 


28 


229 


OX 


1104 


864 


1152 


1344 


360 


672 


5496 


TRAN 


598 


468 


624 


728 


195 


364 


2977 


HYD 


1334 


1044 


1392 


1624 


435 


812 


6641 


LY 


414 


324 


432 


504 


135 


252 


2061 


ISO 


460 


360 


480 


560 


150 


280 


2290 


LIG 


276 


216 


288 


336 


90 


168 


1374 


Sum 


4232 


3312 


4416 


5152 


1380 


2576 


21,068 


B. Number of observed combinations between folds and functions in each of 42 classes (number of filled cells 

A B A/B A + B MULT! 


in Figure 2) 
SML 


Sum 


NONENZ 


34 


30 


14 


28 


4 


26 


136 


OX 


13 


5 


17 


3 


4 


5 


47 


TRAN 


3 


3 


16 


8 


5 




35 


HYD 


4 


11 


30 


18 


4 




67 


LY 


2 


3 


13 


5 






23 


ISO 


1 


2 


7 


4 


2 




16 


LIG 




1 


2 


3 


1 




7 


Sum 


57 


55 


99 


69 


20 


31 


331 


C. Number of folds 


in each of the 42 classes (columns with a filled cell in Figure 2) 
A B A/B A + B 


MULTI 


SML 


Sum 


NONENZ 


34 


30 


14 


28 


4 


26 


136 


OX 


7 


5 


9 


3 


3 


3 


30 


TRAN 


3 


2 


15 


6 


5 




31 


HYD 


4 


8 


19 


18 


3 




52 


LY 


2 


3 


8 


5 






18 


ISO 


1 


2 


7 


4 


2 




16 


LIG 




1 


1 


3 


1 




6 


Sum 


51 


51 


73 


67 


18 


29 


289 


D. Number of functions in each of the 42 classes (rows with a filled cell 

A B A/B 


in Figure 2) 
A + B 


MULTI 


SML 


Sum 


NONENZ 


1 


1 


1 


1 


1 


1 


6 


OX 


8 


5 


9 


3 


3 


5 


33 


TRAN 


2 


3 


13 


8 


4 




30 


HYD 


4 


7 


19 


14 


4 




48 


LY 


2 


2 


7 


3 






14 


ISO 


1 


2 


5 


4 


1 




13 


UG 




1 


2 


2 


1 




6 


Sum 


18 


21 


56 


35 


14 


6 


150 


E. Total number of matching Swissprot sequences in 

A B 


each of the 42 fold-function classes 
A/B A + B 


MULTI 


SML 


Sum 


NONENZ 


1940 


1159 


560 


638 


106 


892 


5295 


OX 


150 


202 


388 


50 


68 


18 


876 


TRAN 


65 


14 


363 


116 


174 




732 


HYD 


116 


394 


295 


452 


92 




1349 


LY 


40 


47 


168 


104 






359 


ISO 


2 


54 


122 


22 


2 




202 


LIG 




5 


26 


69 


24 




124 


Sum 


2313 


1875 


1922 


1451 


466 


910 


8937 


F. How much does each of the fold classes deviate from the average distribution of functions? 








A 


17.5 


<0.01 












B 


5.2 


<0.6 












A/B 


32.5 


<0.00002 












A + B 


7.7 


<0.3 












MULTI 


9.9 


<0.2 












SML 


27.8 


<0.0002 













continued 



Relationship between Protein Structure and Function 



153 



Table 2 — Continued 



G. How much do each of the function classes deviate from the average distribution of folds? 

X 2 P 
NONENZ 40.7 <0.0000002 



OX 

TRAN 

HYD 

LY 

ISO 

UG 



9.9 
13.1 
17.3 
10.2 
5.0 
4.3 



<0.08 

<0.03 

<0.005 

<0.08 

<05 

<0.6 



This Table shows various totals from Figure 2 distributed among the 42 structure-function classes, i.e. the seven functional cate- 
gories in Table 1A multiplied by the six structural categories in Table IB. Part A shows how many potential fold-function combina- 
tions there are in Figure 2 amongst each of the 42 classes. Part B shows how many of these 21,068 possible combinations are 
actually observed. Part C shows the total number of different folds (i.e. selected columns in Figure 1) in each class. Part D shows the 
total number of different functions (i.e. selected rows in Figure 2) in each class. Part E shows the total number of matching Swissprot 
proteins in the 42 classes. Note that to observe a fold-function combination one only needs the existence of a single match between a 
Swissprot protein and a SCOP domain. However, there can be many more. That is why the totals in this Table sum up to so much 
larger an amount than 331. 

Here is an example of how to read parts A to E of the Table, focussing on the all-alpha, oxidoreductase region. Part A shows that 
there are 1104 cells, filled or unfilled, in this region, corresponding to possible combinations. Part B shows that 13 of these 1104 cells 
are filled, corresponding to observed all-alpha, oxidoreductase combinations. Part C shows that there are seven folds, corresponding 
to columns with filled cells in this region. Part D shows that there are eight functions, corresponding to rows with filled cells in this 
region. Finally, in part E we find that there are 150 Swissprot entries that have matches with a SCOP domain. They correspond to 
the 13 observed combinations in Part B. 

Parts F and G give information on the statistical significance of the differences observed between the 42 structure-function classes. 
Part F gives the significance that the observed distribution of fold-function combinations in a given functional class is different than 
average (i.e. the null hypothesis that distribution of fold-function combinations is the same in each functional class). This is very 
similar to the derivation by Martin et at. (1998). A chi-squared statistic is computed for each of the seven functional classes in the 
conventional way: x 2 (0 = S s (0 S f— E S f) 2 /E S f , where for a given functional class /and structure class s, O s j is the observed number of 
fold-function combinations and E s f is the expected number. E s f is simply computed from scaling the "sum" column and row in Part 
B of the Table: E s f~ T s Tj/T, where T s is the total number of combinations in a given structural class s (sum row), Tj- is the total num- 
ber of combinations in a given functional class / (sum column), and T is the total observed number of combinations, 331. Part G 
gives the statistical significance that the observed distribution of fold-function combinations in a given structural class is different 
than average. To compute this one simply sums over functions instead of structures: x 2 {$) = 2/0^ — E s y) 2 / E s f After each chi-squared 
statistic is reported, a rough probability or P-value is given. This gives the chance the observed distribution could be obtained ran- 
domly. 



Restricting the comparison to 
individual genomes 

Figure 3(a) applies to all of Swissprot. Figure 3(b) 
and (c) shows the functional distribution of folds 
taking into account the matches only in two 
specific genomes, yeast and E. coli. Only a fraction 
of each genome could be taken into consideration 
for various reasons (156 proteins in yeast, 244 pro- 
teins in E. coli), mostly due to the great number of 
enzymes having multiple domains in both yeast 
and E. coli. Chi-squared tests show that the fold 
distribution in yeast does not differ significantly 
from that in Swissprot and that the one in E. coli 
differs only slightly (P < 0.25 and P < 0.02, respect- 
ively). The main difference between Swissprot and 
E. coli is the larger fraction of alpha /beta enzy- 
matic folds in the latter (34/93 versus 26/49). There 
are also somewhat more non-enzymatic all-alpha 
and small folds in Swissprot than in the two gen- 
omes. This is principally due to the greater preva- 
lence of globins, myosins, cytochromes, toxins, and 
hormones in Swissprot than in yeast and E. coli. 
Many of these, of course, are proteins usually 
associated with multicellular organisms. We did a 
preliminary version of the fold distribution for the 
worm Caenorhaditis elegans. As expected this distri- 
bution turns out to be similar to that of Swissprot 
(data not shown). 



The yeast genome viewed from different 
classification schemes 

In Figure 4 we focus on the yeast genome in 
more detail, trying to see the effect that different 
classification schemes have on our results. 
Although the total number of counts for our stat- 
istics decrease, in just using yeast relative to all of 
Swissprot, yeast provides a good reference frame 
to compare a number of classification schemes in 
as unbiased a fashion as possible. Also, yeast is 
one of the most comprehensively characterized 
organisms, and there are a number of functional 
classifications available exclusively for this organ- 
ism. 

In part Figure 4(a) we cross-tabulate the struc- 
ture-function combinations in yeast using the 
SCOP and EC systems as we have done for all of 
Swissprot in Table 2B. The yeast distribution is 
fairly similar to that of Swissprot with the only 
major difference being somewhat more alpha /beta 
transferases and fewer alpha /beta hydrolases than 
expected. (A chi-squared test gives P < —0.05 for 
the two distributions to differ. If either the transfer- 
ase or hydrolase difference is removed, P increases 
to -20 %.) 

Figure 4(b) shows the structure-function combi- 
nations based on using the CATH structural classi- 
fication (Orengo et aL, 1997) instead of SCOP. For 
this Figure we mapped the SCOP classification of a 



154 



Relationship between Protein Stmcture and Function 



A. All of Swissprot 



Number of folds in the different functional 

categories 




■Both 

□ ENZ 
□nonENZ 





A 


B 


A/B 


A+B 


MULTI 


SML 


TOTAL 


Both 


4 


13 


9 


6 


2 


1 


35 


ENZ 


12 


6 


34 


28 


11 


2 


93 


nonENZ 


30 


17 


5 


22 


2 


25 


101 



S. Yeast 



Number of folds in the different functional 

categories 




■ Both 

□ ENZ 

□ nonENZ 





A 


B 


A/B 


A+B 


MULTI 


SML 


TOTAL 


Both 


0 


1 


3 


0 


0 


0 


4 


ENZ 


6 


4 


13 


8 


3 


1 


35 


nonENZ 


6 


5 


1 


7 


0 


1 


20 



C. E. colt 



yeast PDB match to its corresponding CATH 
classification and then cross-tabulated the struc- 
ture-function combinations in the various classes. 
Essentially, this Figure shows the results reported 
by Martin et ah (1998) just for yeast 

In Figure 4(c) and (d), which show COGs versus 
SCOP cross-tabulations, we achieve the opposite of 
(b). We change the functional classifications 
scheme but keep SCOP for classifying structures. 
As was the case with the enzyme classification, but 
perhaps even more so, using COGs to classify 
function shows clearly that certain fold classes are 
associated with certain functions and vice versa. 
Most notably, whereas the functions associated 
with metabolism, which are mostly enzymes, are 
preferentially associated with the alpha/beta fold 
class, those associated with cellular processes (e.g. 
secretion) and information processing (e.g. tran- 
scription), show no such preference. They, in fact, 
show a marked preference for all-alpha structure. 
Small proteins are absent from most of the COGs 
classes, except one part of information processing 
and two in cellular processes. 

The COGs system classifies functions for those 
proteins that have clear orthologues in different 
species. Thus, conclusions based on using yeast 
COGs should be readily applicable to other gen- 
omes. This point is highlighted in Figure 4(d), 
which shows a COGs versus SCOP classification 
for only the 110 COGs that are conserved across all 
the analyzed genomes (eight) and all three king- 
doms. Thus, this sub-figure would appear exactly 
the same for E. coli, Methanococcos jannaschii or a 
number of other genomes. It clearly shows how 
much more common the information processing 
proteins are among the most conserved and 
ancient proteins. Moreover, note how these most 
ancient proteins appear to have less of a preference 
for a particular structural class than the "more 
modern" metabolic ones. This suggests that large- 
scale duplication of alpha /beta folds for use in 
metabolism is what gave rise to stronger fold-func- 
tion association in Figure 3(c). 



Number of folds in the different functional 

categories 




■ Both 

□ ENZ 

□ nonENZ 





A 


B 


A/B 


A+B 


MULTI 


SML 


TOTAL 


Both 


1 


2 


3 


3 


1 


0 


10 


ENZ 


4 


5 


26 


10 


4 


0 


49 


nonENZ 


10 


5 


4 


7 


0 


1 


27 



Figure 3. Chart with breakdown among structure- 
function classes in two genomes. Charts arid 
Tables showing the number of folds in each fold class 
associated with only enzymatic (ENZ), only non-enzy- 
matic (nonENZ), and both enzymatic and non-enzy- 
matic functions (Both). The results are shown for (a) all 
of Swissprot, (b) for just the yeast genome, and (c) for 
just the E. coli genome. The results for individual 
domains in a minimum set of SCOP domains also sup- 
port these tendencies (data not shown). The numbers in 
(b) are not based on the PSI-blast protocol used for 
Figure 4. Rather they are found just as "subsets" of the 
overall Swissprot results to make them readily compar- 
able with the rest of the paper. Because of this the num- 
bers in this Figure will not match exactly those in 
Figure 4, the difference having to do with the greater 
number of fold-function combinations found by PSI- 
blast as compared to WU-blast. 



SCOP 



> 

N 

z 

UJ 



NONENZ 
OX 
TRAN 
HYD 
LY 
ISO 
LIG 



B 



B 



A/B 



3.5 



0.7 



2.8 



0.7 




4.3 



1.4 



2.8 



A+B 


MULTI 


9.2 1 


LifJ 


nn 


0.7 


1.4 


1.4 


mm 


1.4 


0.7 





1.4 



1.4 



UJ 

S 
z 

UJ 



CATH 



NONENZ 
OX 
TRAN 
HYD 
LY 
ISO 
LIG 



B AB 




2.6 



1.3 



1.3 



1.3 

[SI 

1.3 



nn 



0.7 
0.7 



CO 

O 

« 
c 
o 

o 
c 

u. 

CL 



nMUbaOsm 



•norgy 

growth, dv.. 
DMA fyn. 

tr»n»crtption 

cyrtt>Ml> 

pro! tin 
t*rg*ntng 

todiuiien 

intnertulii 
tnmpofl 

c*tklftr 
blogwtttf* 

signal 
transduction 

c*l iikw, 
dfltnst... 

tofilc 



SCOP 



1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

13 



B 



A/B A+B 




0.9 



0.5 



0.7 



nn mi nn 



0.9 



0.7 



1.2 



1.1 



0.6 



0.6 



0.3 



03 



□3 EH ill EH 



0.5 



0.3 



0.4 



0.4 



0.4 



0.3 
0.7 
0.7 
0.2 



s 



0.8 
02 



12 



0.8 
02 
0.3 



0.1 

0.3 
0.5 



Hi 
O 
O 
O 

tr> 

CD 
CD 
>- 



E 

M 
O 

B 

0) 



O * C 

t o a 
Ego 
o o 8 



M 

« » 

_ O 

o 2 



SCOP 



c 

E 
F 
G 
H 
I 



J 
K 
L 



M 
N 
O 
P 



B 



A/B A+B 



5 
3 



i.i 



DEI 



0.4 



C 



i.i 



0.4 



07 



0.7 



i 

Hi 

0.7 
0.4 



0.4 



0.7 



0.4 



0.4 



0.4 



0.4 



0.4 



0.7 



0.7 0.4 0,7 

n dd ras ^ 

0.4 I 1.1 | 0.7 



0.4 
0.4 



0.4 



0.4 
04 



(0 

a 
o 
o 

a> 
£ 

(0 

c 
o 
O 

(0 

o 



E 
w 

O 

JQ 

I 



« en » 
t * o 
5; o 
o 2 o 



0) 

CD M 
3 CO 
— O 



C 
E 
F 
G 
H 
I 



J 
K 

L 



M 
N 
0 
P 



SCOP 



B A/B A+B 



1.4 



1.4 

wm 



Hi 



1.4 



1.4 

1.4 
1.4 



1.4 



1.4 



1.4 



1.4 



CO 



1.4 1.4 



1.4 



Figure 4. Structure-function classes in the yeast genome analyzed through a variety of classification schemes. This 
Figure shows the distribution of fold function combinations in the yeast genome as analyzed by a variety of different 
structure and functional classifications. Each of the Figures is a cross- tabulation of one structural classification scheme 
(on the column heads) versus a functional classification (row heads), (a) SCOP versus ENZYME; (b) CATH versus 
ENZYME; (c) SCOP versus COGs; (d) SCOP versus Most Conversed COGs; (e) SCOP versus MIPS Functional Catalo- 
gue. Each of the grid boxes gives the number of fold-function combinations within a structure-function class. This 
number is expressed as a percentage of the total number of combinations in the diagram to make the graphs readily 
comparable. The total number of combinations in each of the sub-figures is (a) 141, (b) 77, (c) 1207, (d) 120, and (e) 
66. (a) and (e) are directly comparable with the cross tabulation in Table 2B for all of Swissprot. In (d) and (e), we 
employ the COGs scheme in exactly the same fashion as we did the ENZYME classification. We form combinations 
between individual yeast COGs and SCOP folds (e.g. COG 0186 with fold 2.26) and then we place these combi- 
nations into larger structure-function classes. The COGs overall functional classes are denoted by a single letter and 
then are in turn grouped into three broader areas (so, for instance, the 0186-2.26 pair would go into the structure- 
function class all-beta, J). We, likewise, proceed similarly for the MIPS yeast functional catalogue. This assigns to each 
function a two or three component number similar to an EC number (e.g. 07.20.3 or 06.2). We use the first two num- 
bers to create combinations with SCOP folds and then use the top number to create the functional classes shown in 
the diagram. For (e) we just use the 110 COGs that are present in all eight genomes in the current COGs analysis 
(E. coli, H. influenzae, H. pylori, M. genitalium, M. pneumoniae, Synechocystis, M. jannaschii, and yeast). 



156 



Relationship between Protein Structure and Function 



Top Multifunctional Folds- 



tsi 9 si « 


< 


S) 4} 4 


h 41 3i 3; 3 


31 3 


3 


a 




I 

** 


it III 




1 


I 


| 


e 


* * 




1 




% 




ill 




§ 

-la. 



N0NEN2 

r 



7 3 14 295 119 



11 75 68 1 1 7 42 



OX 



TRAN 



2.1.3 
2.3.1 
2JB.1 
2.7.1 
2.7.4 
2.7.7 



HYD 



LY 



(SO 



UG 



1.1.1 
u 

1.103 

1.10*9 

1.11.1 

14.1 J 
1.14.14 
1.14.1* 
1.14*9 
1.174 
1.18J 
1.3.1 
\XSB 
iJBA 
1jft-» 
14.3 



--■■X - 



11.1 

3.1.2 

XI J 

X1JT1 

3.1.4 

3l3,1 

M.11 
34.1* 

3lS4 
3.8.1 
3.7.1 
3.8,1 



14,1.1 
4.1.2 

4.1.99 
4A1 



SJ.1 

S4J3 
3*4.99 



9JJ 
*JM 
8*4.1 



23 



1107 



29 



■3- 



42 



87 



1L ■ 



± 



17 



39 



14 



12: 



ft* 



4»4 



±4- 



i i 



1i 



S3! 



3* 



Figure 5. The most versatile folds. The functions 
associated with the 16 most versatile folds are shown. 
Values in the table denote the number of matches 
between a particular fold type in pdb95d (designated by 
its fold number in SCOP 1.35) and an enzyme category 
(represented by the first three components of the 
respective EC numbers). Here and in the following 
Tables the same parameters were used for matching as 
in Figure 2. The numbers in the top row indicate the 
number of functions a particular fold is associated with. 
The identifiers above the fold numbers are either PDB 
or SCOP identifiers of representative structures (the lat- 
ter only if the PDB entry contains more than one 
domain or chain). (See the legend to Table 3 for the syn- 
tax of SCOP identifiers.) The first row in the table with 
the artificial 0.0.0 EC number shows the number of 
matches with non-enzymatic functions. Among the two 
all-alpha folds in the table, cytochrome P450 (1.063) is 
exclusively enzymatic, associated with five different 
enzyme functions, all related to cytochrome P450. Only 
one alpha + beta fold, ferredoxin (4.031), is present in 
the table, predominantly with matches with non-enzy- 
matic ferredoxins, but also with enzymes in four differ- 
ent enzyme classes. In the multi-domain class, beta- 
lactamase/D-ala carboxypeptidase (5.003) has the most 
matches with penicillinase (EC number 3.5.2) and only 
one match with a non-enzyme, which also binds penicil- 
lin but has no enzymatic activity (Coque et dl. f 1993). 
The class of small domains is represented only with one 
fold, membrane-bound rubredoxin-like (7.035), and has 
matches only with enzymes. It is possible that some pro- 
teins classified as ' 'non-enzymes" may indeed be 
enzymes, missing the corresponding EC number. In this 
case, our analysis may be potentially useful in pointing 
to which non-enzymes may actually be enzymes. 



Figure 4(e) shows another functional classifi- 
cation scheme, the MIPS Yeast functional catalogue 
(Mewes et al, 1997). Unlike the COGs scheme, this 
has the advantage of being applicable to every 
yeast open reading frame (ORF). However, it has 
many more categories and about a third of the 
yeast ORFs are classified into multiple categories 
(sometimes five or more), making interpretation of 
the results a bit more ambiguous. 

The most versatile folds and the most 
versatile functions 

Returning to considerations of all of Swissprot, 
Figure 5 lists the 16 most versatile folds. The top 
five are the TIM-barrel, the alpha-beta hydrolase 
fold, the Rossmann fold, the P-loop containing 
NTP hydrolase fold, and the ferredoxin fold. Four 
of these are alpha /beta folds and one is alpha 
-t-beta. All five have non-enzymatic functions as 
well as five to 15 enzymatic ones. The most versa- 
tile folds include four all-beta and two all-alpha 
folds. 

Figure 6 lists the 18 functions that have the most 
different folds associated with them, each having 
at least three associated folds. The most versatile 
functions are those of glycosidases and carboxy- 
lases (3.2.1 and 4.2.1), which are associated with 
seven different fold types each, recruited from at 
least three different fold classes. The next two 
most versatile functions, the phosphoric monoe- 
ster hydrolases and the linear monoester hydro- 
lases (3.1.3 and 3.5.1), are associated with six 
different fold types each. Most of the versatile 
functions are associated with folds in completely 
different fold classes. This suggests that these 
enzymes developed independently, providing 
many examples of convergent evolution. In con- 
trast, only three functions, all oxidoreductases, 
are associated with folds in a single class (last 
three rows in Figure 6). These folds are all 
alpha /beta, namely the TIM-barrel, Rossmann, 
and flavodoxin folds. 

Specific functional convergences involving 
different folds 

Even on the level of specificity of four-com- 
ponent EC numbers, several enzymatic functions 
are performed by unrelated structures. Figure 1 
shows a dramatic example, two different carbonic 
anhydrases with the same EC number 4.2.1.1, but 
with clearly different structures (Kisker et al, 
1996). Table 3 shows further examples in a more 
systematic fashion. Most of these occur in differ- 
ent evolutionary lineages. For instance, the all- 
alpha vanadium chloroperoxidase occurs only in 
fungi, while the alpha/beta non-heme chloroper- 
oxidase occurs only in prokaryotes. Another 
example is beta-glucanase. It has as many as 
three different structural representations, from 
three different fold classes. While it has an all- 
beta structure in Bacillus subtilis, it has an all- 



Relationship between Protein Structure and Function 



157 



B 



A/B 



A+B 



MULTI sml 



S 1 



o 

Q. 



3 
O 

o 

3 

CO 

i 



II all 111 


f : 

;» 

D O B> D a O TJ 

<- - - ^ M..r-_0 t r- 


fill 1 ifttf BtlKliilitit f 


r > 
3§|g &||3l|g||¥f 


!JtI 


n. 


qqqqqcifiqoooq 


oooooooop 
«_ci_rLpI ri pi *i ci fi 


3§Spooo3S833oSS3333pp5<5«5 


M«no»-r>io<ooa>onM« to n 

88S888S 3 S8o§S§ g 


5.001 
5.004 
5.005 
5.007 


(n 
m 
a 

fc. 



|l3S|0.00 r 



3 75 75 68 B| 3 1 11 



160 uMe* 



7 
7 
S 
6 
5 
5 
5 
4 
4 
4 
4 
4 
4 
4 
3 
2 


Jii 

42.1 

J.U 

15.1 

1.11.1 

2.T.1 

4.1.1 

1A3 

1.14.13 

TAJ 

2.5.1 

J.1.1 

322 

J.U 

1.1.1 

U.M 












. 4 












- ?3 


1!55 








96 


3 






I 




1 s 
























i 


96 




i . 
i i 




i 


i 

t_ 




! 
































31; 1 






38 










IS 






i : 
; ■ 


















27 




* 
1 






2i : 




2" 




1 




i 


1 




36 






































! 9 


7 

r* - 






5. 










4 












: I 












'63 












34 






















19 










i 


1 




1 


1 












13 


i 




20 










2 








._!... 








1 








! 






























26 




1 








4 














i 


|56 










































14 


_ J 

1 
















3 


— t — 
















9 


t 








1 












#fj 
































#» 














i 




1 












• 8 








94 


i 






1 












j 

5" 6 


3 






i 


5 














J 

r 
1 



































I 


t 




















It 






6 






3 


































. ! 

: t ^ 







' 












• 


2 












so" " 




















! 








I ' 




















34 






18 






r i ; 






! 

..!- 


, 2 


i 


7 
















3 














... 




10, 


























56 


L 
\ 














29 


























«* 


1 






J 








,_ r 


12 




























i 

L 




































i 


1 




















13 




23 


















i 


l! ' 


t 


















i 




18 












J 
1 












































A 








14 








3' 








































22 






## 




























































i 
















r 


3. 






22 




3 


















i 










































i 




1 










: i 




i 




7| 




4 


2 


















! 1 

















Figure 6. The most versatile functions. Values in the table denote the number of matches between a particular 
enzyme category (designated by the first three components of their EC numbers) and a SCOP 1.35 fold (designated 
by their fold numbers). This Figure follows the same conventions described in the legend to Figure 5. The rows are 
arranged in decreasing order according to the number of different folds with which they are associated (numbers 
shown in the first column). A hash (#) in any cell indicates that its value is greater than 99. 



alpha variant in Bacillus circulans, and an alpha/ 
beta structure in tobacco. 

Specific functional divergences on same fold 

Quite a number of SCOP domains each have 
sequence similarity with Swissprot proteins of 
different function. We separated these into cases 
in which the structural domain has similarity to 
proteins with different enzymatic functions only 
and those in which a domain shows homology to 
both enzymes and non-enzymes (Table 4A and B, 
respectively). Table 4 A includes the well-known 
lactalbumin-lysozyme C similarity and the well- 
documented case of homology between an eye- 
lens structural protein and an enzyme (crystallin 
and gluthathione S-transf erase; Cooper et ah, 
1993; Qasba & Kumar, 1997). It includes several 



other notable divergences, such as the one 
between lysophospholipidase and galectin, and 
the one between an elastase and an antimicrobial 
protein (Morgan et ah, 1991). Remarkably, of the 
seven domains in this Table, three belong to the 
all-beta class. 

"Multifunctionality" versus e-value 

Figure 7 shows how the number of "multifunc- 
tional" domains, i.e. domains with sequence simi- 
larity to proteins with different functions, varies as 
the function of the stringency of the match score 
threshold. We used a rninimal version of SCOP in 
which the structures in PDB were clustered into 
990 representative domains (see the legend to 
Figure 7). The Figure shows how the percentage of 
domains that have sequence similarity to proteins 



Table 3. Specific convergences 



EC# 


Enzymatic function 


Fold #1 


Dom #1 


Swissprot 1 


Fold #2 


Dom #2 


Swissprot 2 


1.11.1.10 


Chloroperoxidase 


3.048.001 


dlbroa_ 


PRXC_PSEPY 


1.068.001 


dlvnc 


PRXC CURIN 


1.15.1.1 


Superoxide dismutase 


2.001.007 


dlsrda_ 


SODl.ORYSA 


4.023.001 


dlmnga2 


SODM BACCA 


3.1.3.48 


Protein-tyrosine phosphatase 


3.028.001 


dlphr 


PTPA STRCO 


3.029.001 


d2hnp_ 


PYP3 SCHPO 


3.1.26.4 


Ribonuclease h 


3.038.003 


d2m2_ 


RNH ECOLI 


3.039.001 


dltfr_ 


RNH BPT4 


3.2.1.4 


Endoglucanase 


1.061.001 


dlcem 


GUNLBACSP 


3.001.001 


dlecea_ 


GUN.BACPO 


3.2.1.8 


Xylanase 


2.018.001 


dlyna 


XYN TRIHA 


3.001.001 


d2exo 


XYNB THENE 


3.2.1.14 


Endochitinase 


3.001.001 


dlhvq 


CHIA TOBAC 


4.002.001 


d2baa_ 


CHIX PEA 


3.2.1.73 


Beta-glucanase* 


3.001.001 


dlghr 


GUB NICPL 


2.018.001 


dlgbg_ 


GUB_BACSU 


3.2.1.73 


Beta-glucanase 


1.061.001 


dlcem 


GUB BACCI 






3.2.1.91 


Exoglucanase 


2.018.001 


dlcela 


GUXIJTRIVI 


3.002.001 


dlcb2a_ 


GUX3_AGABI 


3.5.2.6 


Beta-la ctamase 


5.003.001 


dlbtl_ 


BLP4 PSEAE 


4.083.001 


dlbmc 


BLAB BACCE 


4.2.1.1 


Carbonic anhydrase 


2.053.001 


dlthja 


CAH METTE 


2.047.001 


d2cba 


CAHZ.BRARE 


5.2.1.8 


Cis-trans isomerase 


4.018.001 


dlfkd_ 


MIP TRYCR 


2.041.001 


d2cpl__ 


CYPR DROME 


5.4.99.5 


Chorismate mutase 


1.079.001 


dlcsma_ 


CHMU YEAST 


4.037.001 


d2chsa_ 


CHMU BACSU 



Explicit enzymatic functions associated with different folds. Of the 13 different enzyme functions listed, eight are hydrolases, five 
of which belong to the 3.2.1 EC category. One of them, beta-glucanase, is associated with three different folds. Note that most of the 
enzymes in the Table are associated with folds from different classes. Even when the folds are from the same class, as in the case of 
protein-tyrosine phosphatases, they are clearly different. Fold numbers are from SCOP 1.35. Domain identifiers are according to the 
scop syntax: dlpdbcN, where "lpdb" is a PDB code, c is a chain identifier, and N describes if this is the first, second, or only 
domain in the chain. Thus, dlggtal is the first domain in the A chain of 1GGT. 



CO 

01 
U 

C 

oo 
> 



Cyl 
u 

cn 

* 

m 
H 



O 

o 

s 

N 



O 



CM 

o 



(N 

+-> 

o 

>H 

Oh 

oi 

CO 



C 

o 



o 

Oh 

05 
05 
■■— i 

CO 



Oh 

o 

U 



.5 

(C 

£ 
o 



i 

<U 
u 

ta 
(X 

2i 

(A 
CD 

U 
■+-> 

(0 

s 

05 



si 

o ^ 
Du <-> 

eg ^ 

•c 

u 

PS 

u 



H 

U 

ju 

w 
oi 
O 

Oh 



— c 

>s o 
X 

o 



cu 

5 Oi 

CO 



CU 

05 
ta 

S rH 

05 

oi 2 

X »3 
TJ 
>^ 

•9--S 
-a tj 
o _ 

T| x 

^ o 

q3 



o 



-S 

cu 

o 

a. 

c 
o 

• r* 

CU 
OI 

ta 

s 

60 
O 

X* 



05 



TJ 

<U 

3 

o 

u 

TJ 

•S 

M 
H-» 

O) 
H-> 

">> 

6 




cu 

05 

'o 

cu 

ta 
X 

a, 

05 

o 



oo 

X V O 

•Nil 




00 
T— t 

cK ^ 

ON 1—1 

CM Tjj 




o o 
p o 

o o 
p p 

i-H r-5 
O O 

o o 

^ in 
in m 
o o 

r-5 CM 

o o 
o o 



CD TJ 



X 

(0 
CM 
TJ 



TJ 
X 



Ph 

§ 

u 

i 

D 
OS 

D 
Ch 



D 

co 

U 
< 

CO 

I 

CQ 

< 
CO 

u 



CO 

m 
cu 

1/5 

J3 
u 

TJ 

K 

ai 

05 -H 

— co <a 



0 



5 co^ o 3 



00 
CO o) 

CO r-4 

t-5 co 

rji ,-4 



^ o 
o u 

2 w 



O O 

O O 

r-i (N 

o o 

p p 

CO i—l 

o o 

o p 

f-H 00 

O i-t 

o o 

88 

o o 



CO 

CO CN 

i-j CN 

CO r-j 

i-H (N 



H 

CO 

< 



CO 



CN 



D 



< 

o 



CO 

o 
o 

in 
o 
p 
i—5 
o 
o 

CM 
O 

CO 

o 
o 



o 
p 

o 
p 

r-i 
O 
O 

CO 

p 

CO 

o 
o 



o 
o 

CO 

o 
p 

r-i 
O 
O 

p 

CO 

o 
o 





CM 




i— ( 


o 






1 1— t 


LO 


rH 




rH 


00 


in 

* 


CM CO 


r-i 


r-i 


oo 


r-i 


CO 


rH 


CM i-i 


CM 


r-i 


rH 


in 




00 


CO CM 




r-i 


r-i 


CO 


in 


CO 



^ ai 
^ <"! oo 

H (\| H 

co i—5 



< J ^ 




CM ^ 
^ CM 



CM 



CO 



CO r-j 

i— 5 i— i 

IT) CM rH 




O 
O 

CO 

o 
p 

r-i 
O 

o 
oo 

p 

CO 

o 
o 



O rH O 

pop 

r-i CM r-i 

o o o 
pop 

r-i rH rH 

o o o 
pop 

co cm in 

in O rH 

o o o 

8 3 S 

o o o 



I I 

Cu u 
X TJ 
TJ X 
rH i— < 

tJ TJ 



I 

a, 

■a 



i 

»-« 

« 
60 



X 

TJ 
CM 
TJ 



<u 

TJ 

CU 




s 

N 

V 



E 

N 

CQ 



o 

'X 
u 



6 

c 
o 

Z 



CM 

2 
a, 

V5 
05 

CO 

















CO 






CM 


in 




1 


H 


in 


r-i 


ON 


CM 


CM 


CO 


r-i 


r-i 


CN 


ON 


CO 


K 




in 


r-i 


^i 


CO 


r-i 






CM 


CO 


CO 


iri 


CO 


CM 


r-i 



c 

o 

X 
u 



e 



o 

Oh 

05 

CO 



Ph (0 

O g 
U o 

CO TJ 



CO 

U O QJ 

co U 



o 
o 

r-i 
O 
O 
i—t 
O 
O 

^ 
CO 
O 

i— t 

o 
o 



I 

oi 
60 




s 

cu 

60 



— 05 _ 

X cu 5 

5 •§ £ 

co Ow 



.S 

cu 
o 

Oh 

05 

2 
x 

60 

2 




< 

u 
u 
< 



3 
O 
u 

co 

CO 5 

a co 



s 

u 



o 



fc 1 



cu cu 

05 05 

nJ ta 

Ch"05 

n, >-> 

05 js 
x u 

Oh 05 

o w 

05 > 
X>05 

li 

o (JO 
05 TJ 



e 5 



cu 

05 
5 



OJ 

a 

TJ 



60 
to 

H-< 

05 

2 

Ch 



CU 
05 

i3 
ta 
X 



cu 

05 

ta 
TJ 

X 

o 

g 



QJ 
05 

O-X 

05 +3 

° c „ 

(X 05 CU 
° 05 

§ s 

On 1 5 

*lg g 



o 

05 

3 



Ih 
Oh 

OJ 
05 

ta 

TJ 



z 
< 



Ph ^ 



CO CO 

o o 
o p 

CO CM 

o o 
p p 

r-i 1—5 

o o 
o o 

co ai 

rH CM 
O O 

CM CM 

o o 
o o 




(J 

Ch 



N N H H H 

o o o o o 

p p p p p 

t—5 r-i CM CM r-5 

o o o o o 

o o o o o 



o o 
o o 

ai ai 
co co 
p o 

CM CM 

o o 
o o 



o o o 
o o o 
K K on 

O O CM 

o o p 

' in 
o o 
o o o 



8 



M x 

i— 1 rH 
TJ TJ 




ta 



.11 

05 , , 



QJ 



Relationship between Protein Structure and Function 



159 



Relative number of domains with multiple functions, 
as the function of e -value threshold 




0 A . r , , 1 

0 10 20 30 40 50 60 70 

-log(e-value) 



Figure 7. Multi-functionality versus e-value threshold. 
The graph shows how the percentage number of multi- 
functional enzymatic domains varies as the function of 
the e-value threshold. A multi-functional domain occurs 
when a particular domain in SCOP matches domains in 
Swissprot with different enzymatic function. For these 
calculations, we had to use a more minimal version of 
SCOP than the pdb95d dataset referred to in the 
methods to prevent double matches, i.e. two SCOP 
domains matching a single Swissprot domain. The con- 
struction of this minimal SCOP was described pre- 
viously (Gerstein, 1998a). Basically, all the domains in 
SCOP were clustered via a multi-linkage approach into 
990 representative domains, such that no two domains 
matched each other with a FastA e-value better than 
0.01. 



with different functions (in terms of three-com- 
ponent EC numbers) varies with sequence simi- 
larity. This decreases approximately monotonically 
as a function of the exponent of the e-value 
threshold. Interestingly, there is a breaking point 
around log (e-value) = —5, as the sharply decreas- 
ing number of functions slows down and the 
matches reach the level of biological significance. 

Our graph can be loosely compared with the 
classic graph by Chothia & Lesk (1986) showing 
the relation of similarity in structure to that in 
sequence. It roughly shows the chance of func- 
tional similarity (or more precisely the chance of 
functional difference) with a given level of 
sequence similarity between an enzyme and a pro- 
tein of unknown function. For example, with an 
e-value of 10" 10 , there is only an ~5 % chance that 
an unknown protein homologous to a certain 
enzyme has in fact a different function. Moreover, 
our graph is in excellent agreement with the find- 
ings by Russell et al (1998) who also found that 
the proportion of homologues with different func- 
tions is around 10%. This shows that there is a 
low chance that a single-domain protein, highly 
homologous to a known enzyme, has a different 
function. 

Discussion and Conclusions 

Overview 

We have investigated the relationship between 
the structure and function of proteins by compar- 



ing functionally characterized enzymes in Swis- 
sprot with structurally characterized domains in 
SCOP. It is a timely subject, as the number of 
three-dimensional protein structures is increasing 
rapidly and the recent completion of several 
microbial genomes highlights the need for func- 
tional characterization of the gene products and 
identification of enzymes participating in metabolic 
pathways (Koonin et al, 1998). 

We tried to be as objective and as unbiased as 
possible, taking only enzymes with a single 
assigned function and only single-domain matches. 
We ignored Swissprot proteins with dubious or 
unknown function, or with incomplete sequence. 
Given these criteria, several tendencies are clear. 
The alpha /beta folds tend to be enzymes. The all- 
alpha folds tend to be non-enzymes and the all- 
beta and alpha + beta folds tend to have a more 
even distribution between enzymes and non- 
enzymes. 

Our analysis of proteins from yeast and E. coli 
has shown that the functional distribution of 
folds does not differ greatly from the whole of 
Swissprot. E. coli, however, appears to have 
somewhat more alpha/beta enzymes and less 
non-enzymes. 

Functional assignment complexities 

We identified four specific complexities in our 
functional assignment worth mentioning. 

Firstly, there is not always a one-to-one relation- 
ship between gene protein and reaction (Riley, 
1998). An enzyme can have two functions, or two 
polypeptides from two different genes can oligo- 
merize to perform a single function. It might be 
that some of the fold-functions combinations in 
Figure 2 occur together in multi-domain proteins 
(which otherwise were not the subject of this sur- 
vey). An exhaustive screening revealed that only 
four pairs of folds in Figure 2 were present concur- 
rently in multi-domain proteins. Each of these 
reduced by one the number of independent fold- 
function combinations. (The four pairs were as fol- 
lows, with one representative Swissprot protein in 
each category, EC numbers in parentheses, and 
then SCOP fold numbers: PTAAJECOLI (2.7.1) has 
4.049 and 2.055 folds, TRP__COPCI (4.2.1) has 3.057 
and 4.005 folds, URE1_HELFE (3.5.1) has 4.005 and 
2.056 folds, while XYNA_RUMFL (3.2.1) has 2.018 
and 3.001 folds.) 

Secondly, the functions associated with similar 
structures often turn out to be analogous, even if 
they show significant difference in their EC num- 
bers. For example, acetyl-CoA carboxylase and 
methylmalonyl-CoA carboxyltransferase enzymes 
are both actually part of enzyme complexes in 
which they perform the same function, acting as 
enzyme carriers. This similarity is not reflected in 
their EC classification numbers (6.4.1.2 and 2.1.3.1, 
respectively). 

Thirdly, there are clearly some drawbacks to the 
EC system. The EC system is a classification of 



160 



Relationship between Protein Structure and Function 



reactions, not underlying biochemical mechanisms. 
An enzyme classification system based explicitly 
on reaction mechanism (e.g. "involves pyridoxal 
phosphate' 7 or "involves Ser as a nucleophile") 
might also prove interesting to compare with pro- 
tein structure. Alternatively, one based on path- 
ways might be worthwhile since, as pointed out by 
Martin et al. (1998), "it may be that more signifi- 
cant relationships occur within pathways, where 
the substrate is successively transferred from 
enzyme to enzyme along the pathway, requiring 
similar binding sites at each stage". 

Finally, in all of Swissprot the majority of the 
101 folds with only non-enzymatic functions prob- 
ably have several functions, but we were not able 
to consider them separately here, lacking a general 
protein function classification system for non- 
enzymes. Such a system is not easy to derive. For 
instance, if we took only the first three words of all 
the description lines in Swissprot, we would end 
up with about 10,000 different protein functions 
(besides enzymes). An approximate solution to this 
problem is offered by a recent work that has classi- 
fied 81 % of Swissprot into one of three broad cat- 
egories in an automated fashion (Tamames et a\., 
1997). However, one way we did tackle this pro- 
blem was by focussing on the yeast genome for 
which there are a number of overall functional 
classification systems. This work showed that the 
preferred association of folds with certain functions 
occurs for non-enzymes as well as enzymes. Fur- 
thermore, the results for the highly conserved 
COGs would be expected to be exactly the same in 
other genomes. 

Biases 

Our results are undoubtedly affected to some 
degree by the biases inherent in the databanks, e.g. 
towards mammalian, medically relevant proteins 
and towards proteins that easily crystallize. Such 
biases probably result in the higher representation 
of enzymes in the structural databases, in the PDB 
and therefore in SCOP. This might be the cause of 
the higher occurrence of alpha /beta proteins in our 
tables and the higher density of matches in this 
class. 

One interesting question related to biases is 
whether looking only at individual genomes 
instead of the whole database will give different 
results. Our results for yeast suggest that it is not 
necessarily the case. 

Comparison with Martin et al. (1998) 

Martin et al. (1998) performed a similar analysis 
to the one described here. One of the conclusions 
of their careful study was that there was no 
relationship between the top-level CATH classifi- 
cation and the top-level EC class. This seems to be 
at odds with our results. However, we have found 
the conclusions to be consistent. There are a num- 
ber of reasons for this. 



Firstly, Martin et al. (1998) tabulate statistics on 
only the proteins in the PDB. They found a clear 
alpha /beta preference for proteins in the oxido- 
reductase, transferase, and hydrolase categories 
(EC 1-3), but for the lyase, isomerase, and ligase 
categories (EC 4-6) they observe different ten- 
dencies. However, they did not have sufficient 
counts to establish statistical significance for this 
latter finding. (This is basically what we observe in 
Figure 4(b).) Because in our analysis we use all of 
Swissprot and we tabulate our statistics a little dif- 
ferently (in terms of combinations), we get more 
"counts" than Martin et al (1998). Thus, we are 
able to argue that the different distribution of fold- 
function combinations observed for lyases, iso- 
merases, and ligases are significant. This is borne 
out by the chi-squared statistics at the end of 
Table 2. 

Secondly, Martin et al. "no-relationship" con- 
clusion applies only to comparisons between the 
different enzyme classes. However, we find our 
largest differences when comparing non-enzymes 
to enzymes and also comparing between the var- 
ious types of non-enzymes. 

Finally, the CATH classification that Martin et al 
use has only three classes in its top-most level. In 
contrast, SCOP has six top classes (Table 1). While 
this larger number of categories does tend to 
degrade our statistics somewhat, it also highlights 
some differences that cannot be observed in terms 
of the CATH classes alone, e.g. we find clear differ- 
ences between alpha -h beta and alpha /beta pro- 
teins and also between small proteins and all 
others. 

Apparently high occurrence of 
convergent evolution 

Note that the table in Figure 2 is not square: it 
has more folds than functions. This shape leads to 
a number of interesting conclusions. The 331 fold- 
function combinations we observe for 229 folds 
and 92 functions imply that there are 1.2 functions 
per fold and 3.6 folds per function. However, these 
numbers are somewhat skewed by the large num- 
ber of folds (101) associated only with the single 
non-enzymatic function. If we exclude these, we 
get 128 "enzyme-related" folds, which are, in turn, 
associated with 230 (= 331 - 101) different fold- 
function combinations. This implies that for the 
enzyme-related folds there are on average 1.8 func- 
tions per fold and 2.5 folds per function (230/128 
and 230/92). The larger number of folds per func- 
tion than functions per fold seems to suggest that 
nature tends to reinvent an enzymatic function (i.e. 
convergent evolution) more often than modify an 
already existing one (i.e. functional divergence). 

How can we explain this? Firstly, 1.8 is a lower 
estimation for the number of functions per fold as 
the non-enzymatic functions were bundled into 
one group here. Secondly, there are several 
examples of functional divergence for a fold within 
one three-component enzyme category that are not 



Relationship between Protein Structure and Function 



161 



reflected in our Tables. For instance, the 1.1.1 cat- 
egory has 248 different enzymes, which all share 
the same fold. Thirdly, the results in this paper 
were derived from databases comprised of data 
from several organisms. It is quite possible that 
within one organism, functional divergence is 
more prevalent than convergent evolution. 

Superfolds and superfunctions 

Are functions more diverse for the more com- 
mon folds? To some degree this brings up a 
"chicken-and-egg" issue. Do folds have more func- 
tions because they occur more often or is it the 
other way around? The commonness of a fold is 
often quantified by the number of non-homologous 
sequence families accommodated by the fold, and 
folds accommodating many families of diverse 
sequences have been dubbed "superfolds" 
(Orengo et ah, 1993). We find that there seems to 
be a loose connection between the number of 
diverse sequence families associated with a par- 
ticular fold (in SCOP) and the functional diversity 
of that fold. For instance, the top superfold is the 
TIM-barrel; it also has the most functions associ- 
ated with it (15 different enzymatic functions as 
shown in Figure 4). On the other hand, there are 
exceptions: the alpha/beta hydrolases and the 
Rossmann fold are both associated with 22 
sequence families in SCOP, but while the former 
has eight different enzymatic functions, the latter 
has only three. 

Finally, while there is a high incidence of par- 
ticular functions with many folds ("superfunc- 
tions"), as well as folds with many functions, the 
distribution of superfunctions appears to be more 
uniform and less concentrated on a few exception- 
ally versatile individuals than is the case for folds. 
That is, comparing Figures 3 and 4 one can see 
that the top nine most versatile functions are 
associated with five to seven folds while the top 
nine most versatile folds carry out from six to as 
many as 16 functions. This last value is for the 
TIM-barrel and underscores the uniqueness of this 
fold as a generic scaffold (see Figure 1 for an illus- 
tration of this fold). 

Why folds are associated with functions: 
chemistry versus history 

Why is a certain fold chosen to carry out a par- 
ticular function? It is, of course not possible to 
answer this question definitively at present. How- 
ever, there are two broad themes that emerge from 
our analysis. The first is favorable chemistry. Per- 
haps the TIM-barrel design simply provides a 
"more efficient" scaffold for enzyme reactions so 
that is why it is so prevalent. Another factor is his- 
tory. Perhaps the association between a particular 
fold and its function reflects a particular "accident" 
that took place at the beginning of cellular evol- 
ution. However, once this choice was made it was 
impossible to undo even if other folds would be 



more chemically suitable. This could be the situ- 
ation for the ribosomal proteins (and is borne out 
by the results of Figure 4(d)). 

Materials and Methods 

Sequence matching to swissprot 

All the protein sequences in Swissprot 35 were com- 
pared with all the protein domain sequences in SCOP 
135 by standard database search programs (WU-BLAST; 
Altschul et aL, 1990). The following five criteria were 
used in the searches: (1) At least three of the four com- 
ponents of the EC number are assigned in the DE line of 
the Swissprot entries. (2) Fragments in Swissprot were 
excluded (this affected about 10 % of the entries). (3) For 
WU-BLAST searches an e-value threshold of 0.0001 was 
used, unless stated otherwise. (4) Only "monoenzymes", 
i.e. proteins with only one enzymatic function, were con- 
sidered. This excluded less than 0.5% of the Swissprot 
enzymes. (5) Only single-domain matches with Swis- 
sprot proteins were taken into consideration. This means 
those proteins that had a match with a SCOP domain 
covering most of the Swissprot protein. Specifically, we 
required that less than 100 amino acid residues be left 
uncovered in the Swissprot entry by a match. We are 
aware that this is only an approximation, as there are 
domains with less than 100 amino acid residues; how- 
ever, it is considerably less than the average length of a 
SCOP domain (163 residues) and seems to be a reason- 
able threshold in an automated approach. 

All the searches were repeated using FASTA with an 
e-value threshold of 0.01 (Pearson, 1998; Pearson & 
Lipman, 1988). The results obtained by the two different 
comparison programs were in agreement with each 
other. That is, the FASTA searches did not result in any 
new combinations of folds and enzymatic functions (a 
new dot in Figure 1), and therefore are not shown. 

Sequence matching to the yeast genome 

To get as great a coverage of the yeast genome as 
possible, we did a sequence comparison for just Figure 4 
using an altered protocol. We first ran the PDB against 
the yeast genome using FASTA and kept all matches 
with a better than 0.01 e-value (Pearson, 1998; Pearson & 
Lipman, 1988). Then, to increase our number of matches 
further we used the PSI-blast program (Altschul et aL, 
1997). This program is somewhat more complex to run 
than FASTA, involving embedding the yeast genome 
in NRDB and running PDB query sequences against it 
in an iterative fashion, adding the matches found at 
each round to a growing profile. We used the PSI-blast 
parameters adapted from Teichmann et aL (1998): an 
e-value threshold of 0.0005 to include matches in the 
profile and iteration of up to 30 times or to conver- 
gence. We did not continuously parse the output and 
accepted matches at the final iteration that had E-value 
scores better than 0.0001. The number of iteration to 
convergence varies depending on the PDB domains 
being run. Runs that take many iterations such as 
those for the immunoglobulin superfamily take quite a 
long time (up to 30 minutes on DEC 500 MHz work- 
station) and create large output hies. In total, PSI-blast 
finds many more matches than either FASTA or WU- 
BLAST. However, it has problems with certain small 
and compositionally biased proteins. We used FASTA 
for these and also tried to remove compositional bias 



162 



Relationship between Protein Structure and Function 



through running the SEG program with standard par- 
ameters (Wootton & Federhen, 1996). 



How the structural classifications were used: SCOP 
and CATH 

SCOP hierarchically clusters all the domains in the 
PDB database, assigning a five-component number to 
each domain (Murzin et al, 1995). The first component in 
the SCOP numbers denotes the structural class to which 
the domain in question belongs. The second component 
of the SCOP numbers designates the fold type of the 
domain. There are altogether 361 different fold types in 
SCOP 1.35. The six SCOP classes used in this survey are 
listed in Table IB. 

In this study, a 95% non-redundant subset of SCOP 
was used, i.e. all pairs of domains had less than 95% 
sequence homology. This set is denoted pdb95d and is 
available from the SCOP website (scop.mrc-lmb.cam.a- 
c.uk). We used version 1.35, which had 2314 protein 
domains. (The yeast analysis used a more recent version 
of SCOP, 1.38, which had 3206 domains.) 

The CATH classification classifies structures in analo- 
gous fashion to SCOP (Orengo et ah, 1997). However, the 
exact structure of the classification is not the same, 
with an additional architecture level inserted between 
the top-level class and the fold-level. In our use of 
the classification, we created a limited mapping table 
that associated each SCOP domain in pdb95d with 
its corresponding classification in CATH 1.4. This 
was not always possible to do unambiguously. As a 
result, we left out the ambiguous matches from the 
statistics. 



How the functional classifications were used: 
ENZYME, COGS, and MIPS 

The EC numbers of enzymes are composed of four 
components (Barrett, 1997). (1) The first component 
shows to which of the six main divisions the enzyme 
belongs. (2) The second figure indicates the subclass 
(referring to the donor in oxidoreductases or the group 
transferred in transferases, or the affected bond in hydro- 
lases, lyases or ligases). (3) The third figure indicates the 
sub-subclass (e.g. indicating the type of acceptor in 
oxidoreductases). (4) The fourth figure gives the serial 
number of the enzyme in its sub-subclass. The six main 
divisions are listed in Table 1A. 

In the analysis of all of Swissprot, when we counted 
the number of non-enzymatic matches, all the proteins 
called 'HYPOTHETICAL' and all the proteins having an 
'-ase' word ending but lacking an EC number in their 
description were excluded, because of their functional 
ambiguity. For relating the sequence matches of the 
yeast genome to the EC system, we used essentially the 
same criteria as we did for all of Swissprot (see above): 
single-domain, monoenzyme matches with at least a 
three-component EC number. 

The COGs and especially the MIPS classifications are 
a bit more complex than the EC system in that they 
include non-enzymes as well as enzymes (Tatusov et al, 
1997; Koonin et al, 1998; Mewes et al, 1997). They often 
associate multiple functions or roles to a given yeast 
ORF. This happens for more than a third of the yeast 
ORFs with MIPS. In this case, if we could clearly show a 
PDB match was associated with a single functional 
domain we made only that pairing. Otherwise we associ- 



ated all the functions assigned to a given PDB match to 
its respective fold. 



Availability of results over the internet 

A number of detailed tables relevant to our study 
will be made available over the Internet at http:// 
bioinfo.mbb.yale.edu/genome/foldfunc, in particular, a 
"clickable" version of Figure 1 and large data files giving 
all the fold assignment and fold-function combinations 
for Swissprot and yeast. 



Acknowledgments 

We thank the Donaghue Foundation and the ONR for 
financial support (grant N000149710725). We thank Ted 
Johnson for help with the minimal version of the SCOP 
database. 



References 

Altschul, S., Gish, W., Miller, W., Myers, E. W. & 
Lipman, D. J. (1990). Basic local 
alignment search tool. J. Mol. Biol. 215, 403-410. 

Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., 
Zhang, Z., Miller, W. & Lipman, D. J. (1997). 
Gapped BLAST and PSI-BLAST: a new generation 
of protein database search programs. Nucl. Acids 
Res. 25, 3389-3402. 

Artwood, T. K., Beck, M. E., Flower, D. R., Scordis, P. & 
Selley, J. N. (1998). The PRINTS protein fingerprint 
database in its fifth year. Nucl. Acids Res. 26, 304- 
308. 

Bairoch, A. (1996). The ENZYME data bank in 1995. 
Nucl Acids Res. 24, 221-222. 

Bairoch, A. & Apweiler, R. (1998). The SWISS-PROT 
protein sequence data bank and its supplement 
TrEMBL in 1998. Nucl. Acids Res. 26, 38-42. 

Bairoch, A., Bucher, P. & Hofmann, K. (1997). The PRO- 
SITE database, its status in 1997. Nucl. Acids Res. 
25, 217-221. 

Barrett, A. J. (1997). Nomenclature Committee of the 
International Union of Biochemistry and Molecular 
Biology (NC-IUBMB). Enzyme nomenclature. Rec- 
ommendations 1992. Supplement 4: corrections and 
additions (1997). Eur. }. Biochem. 250, 1-6. 

Bork, P. & Eisenberg, D. (1998). Deriving biological 
knowledge from genomic sequences. Curr. Opin. 
Struct. Biol 8, 331-332. 

Bork, P. & Koonin, E. V. (1998). Predicting functions 
from protein sequences-where are the bottlenecks? 
Nature Genet. 18, 313-318. 

Bork, P., Sander, C. & Valencia, A. (1993). Convergent 
evolution of similar enzymatic function on different 
protein folds: the hexokinase, ribokinase, and galac- 
tokinase families of sugar kinases. Protein Sci 2, 31- 
40. 

Bork, P., Ouzounis, C. & Sander, C. (1994). From gen- 
ome sequences to protein function. Curr. Opin. 
Struct Biol 4, 393-403. 

Chen, L., DeVries, A. L. & Cheng, C. H. (1997). Conver- 
gent evolution of antifreeze glycoproteins in Antarc- 
tic notothenioid fish and Arctic cod. Proc. Natl Acad. 
Sci. USA, 94, 3817-3822. 



Relationship between Protein Structure and Function 



163 



Chothia, C. & Lesk, A. M. (1986). The relation between 
the divergence of sequence and structure in pro- 
teins. EMBO J. 5, 823-826. 

Cooper, D. L., Isola, N. R., Stevenson, K. & Baptist, 
E. W. (1993). Members of the ALDH gene family 
are lens and corneal crystallins. Advan. Exp. Med. 
Biol 328, 169-179. 

Coque, J. J., Liras, P. & Martin, J. F. (1993). Genes for a 
beta-lactamase, a penicillin-binding protein and a 
transmembrane protein are clustered with the 
cephamycin biosynthetic genes in Nocardia lactam- 
durans. EMBO }. 12, 631-639. 

Corpet, F., Gouzy, J. & Kahn, D. (1998). The ProDom 
database of protein domain families. Nucl. Acids 
Res. 26, 323-326. 

des, Jardins M., Karp, P. D., Krummenacker, M., Lee, 
T. J. & Ouzounis, C. A. (1997). Prediction of enzyme 
classification from protein sequence without the use 
of sequence similarity. ISMB, 5, 92-99. 

Doolittle, R. F. (1994). Convergent evolution: the need to 
be explicit. Trends Biochem. Sci. 19, 15-18. 

Fabian, P., Murvai, J., Hatsagi, Z., Vlahovicek, K., 
Hegyi, H. & Pongor, S. (1997). The SBASE protein 
domain library, release 5.0: a collection of annotated 
protein sequence segments. Nucl. Acids Res. 25, 240- 
243. 

Frishman, D. & Mewes, H.-W. (1997). Protein structural 
classes in five complete genomes. Nature Struct. 
Biol. 4, 626-628. 

Galperin, M. Y., Walker, D. R. & Koonin, E. V. (1998). 
Analogous enzymes: independent inventions in 
enzyme evolution. Genome Res. 8, 779-790. 

Gerstein, M. (1997). A structural census of genomes: 
comparing eukaryotic, bacterial and archaeal gen- 
omes in terms of protein structure. /. Mol. Biol. 274, 
562-576. 

Gerstein, M. (1998a). How representative are the known 
structures of the proteins in a complete genome? 
A comprehensive structural census. Fold. Design, 3, 
497-512. 

Gerstein, M. (1998b). Patterns of protein-fold usage in 
eight microbial genomes: a comprehensive struc- 
tural census. Proteins: Struct, Funct. Genet. 33, 518- 
534. 

Gerstein, M. & Hegyi, H. (1998). Comparing microbial 
genomes in terms of protein structure: surveys of 
a finite parts list. FEMS Microbiol. Rev. 22, 277- 
304. 

Gerstein, M. & Levitt, M. (1997). A structural census of 
the current population of protein sequences. Proc. 
Natl Acad. Sci. USA, 94, 11911-11916. 

Hellinga, H. W. (1997). Rational protein design: combin- 
ing theory and experiment. Proc. Natl Acad. Sci. 
USA, 94, 10015-10017. 

Hellinga, H. W. (1998). Computational protein engineer- 
ing. Nature Struct. Biol 5, 525-527. 

Henikoff, S., Pietrokovski, S. & Henikoff, J. G. (1998). 
Superior performance in protein homology detec- 
tion with the Blocks Database servers. Nucl. Acids 
Res. 26, 309-312. 

Hodges, P. E., Payne, W. E. & Garrels, J. I. (1998). The 
Yeast Protein Database (YPD): a curated proteome 
database for Saccharomyces cerevisiae. Nucl. Acids 
Res. 26, 68-72. 

Holm, L. & Sander, C. (1998). Touring protein fold 
space with Dali/FSSP. Nucl. Acids Res. 26, 316-319. 

Ibba, M., Bono, J. L., Rosa, P. A. & Soil, D. (1997a). 
Archaeal-type lysyl-tRNA synthetase in the Lyme 



disease spirochete Borrelia burgdorferi. Proc. Natl 

Acad. Sci. USA, 94, 14383-14388. 
Ibba, M., Morgan, S., Curnow, A. W v Pridmore, D. R., 

Vothknecht, U. C, Gardner, W., Lin, W., Woese, 

C. R. & Soil, D. (1997b). A euryarchaeal lysyl-tRNA 

synthetase: resemblance to class I synthetases. 

Science, 278, 1119-1122. 
Karp, P. (1998). What we do not know about sequence 

analysis and sequence databases. Bioinformatics, 14, 

753-754. 

Karp, P. D., Riley, M., Paley, S. M., Pellegrini-Toole, A. 
& Krummenacker, M. (1998). EcoCyc: encyclopedia 
of Escherichia coli genes and metabolism. Nucl. Acids 
Res. 26, 50-53. 

Kisker, C, Schindelin, H., Alber, B. E., Ferry, J. G. & 
Rees, D. C (1996). A left-hand beta-helix revealed 
by the crystal structure of a carbonic anhydrase 
from the archaeon Methanosarcina thermophila. 
EMBO }. 15, 2323-2330. 

Koonin, E. V. & Galperin, M. Y. (1997). Prokaryotic gen- 
omes: the emerging paradigm of genome-based 
microbiology. Cwrr. Opin. Genet. Dev. 7, 757-763. 

Koonin, E. V. & Tatusov, R. L. (1994). Computer anal- 
ysis of bacterial haloacid dehalogenases defines a 
large superfamily of hydrolases with diverse speci- 
ficity. Application of an iterative approach to data- 
base search. /. Mol. Biol 244, 125-132. 

Koonin, E. V., Tatusov, R. L. & Galperin, M. Y. (1998). 
Beyond complete genomes: from sequence to struc- 
ture and function. Curr. Opin. Struct. Biol. 8, 355- 
363. 

Kraulis, P. J. (1991). MOLSCRIPT-a program to produce 
both detailed and schematic plots of protein struc- 
tures. /. Appl Crystallog. 24, 946-950. 

Martin, A. C, Orengo, C. A., Hutchinson, E. G., Jones, 
S., Karmirantzou, M., Laskowski, R. A., Mitchell, 
J. B., Taroni, C. & Thornton, J. M. (1998). Protein 
folds and functions. Structure, 6, 875-884. 

Marvin, J. S v Corcoran, E. E., Hattangadi, N. A., Zhang, 
J. V v Gere, S. A. & Hellinga, H. W. (1997). The 
rational design of allosteric interactions in a mono- 
melic protein and its applications to the construc- 
tion of biosensors. Proc. Natl Acad. Sci. USA, 94, 
4366-4371. 

Mewes, H. W., Albermann, K., Bahr, M v Frishman, D., 
Gleissner, A., Hani, J., Heumann, K., Kleine, K., 
Maierl, A., Oliver, S. G., Pfeiffer, F. & Zollner, A. 
(1997). Overview of the yeast genome. Nature, 387, 
7-65. 

Morgan, J. G., Sukiennicki, T., Pereira, H. A., Spitznagel, 
J. K., Guerra, M. E. & Larrick, J. W. (1991). Cloning 
of the cDNA for the serine protease homolog 
CAP37/azurocidin, a microbicidal and chemotactic 
protein from human granulocytes. /. Immunol. 147, 
3210-3214. 

Murzin, A., Brenner, S. E., Hubbard, T. & Chothia, C. 
(1995). SCOP: a structural classification of proteins 
for the investigation of sequences and structures. 
/. Mol. Biol 247, 536-540. 

Ogata, H, Goto, S., Sato, K., Fujibuchi, W., Bono, H. & 
Kanehisa, M. (1999). KEGG: Kyoto encyclopedia of 
genes and genomes. Nucl Acids Res. 27, 29-34. 

Orengo, C. A., Flores, T. P., Taylor, W. R. & Thornton, 
J. M. (1993). Identifying and classifying protein fold 
families. Protein Eng. 6, 485-500. 

Orengo, C A., Michie, A. D., Jones, S., Jones, D. T., 
Swindells, M. B. & Thornton, J. M. (1997). CATH-a 
hierarchic classification of protein domain struc- 
tures. Structure, 5, 1093-1108. 



164 



Relationship between Protein Structure and Function 



Pearson, W. R. (1996). Effective protein sequence com- 
parison. Methods Enzymol 266, 227-259. 

Pearson, W. R. (1998). Empirical statistical estimates for 
sequence similarity searches. /. Mol. Biol. 276, 71- 
84. 

Pearson, W. R. & Lipman, D. J. (1988). Improved tools 
for biological sequence analysis. Proc. Natl Acad. Sci. 
USA, 85, 2444-2448. 

Qasba, P. K. & Kumar, S. (1997). Molecular divergence 
of lysozymes and alpha-lactalbumin. Crit. Rev. Bio- 
chem. Mol Biol 32, 255-306. 

Riley, M. (1997). Genes and proteins of Escherichia coli 
K-12 (GenProtEC). NucL Acids Res. 25, 51-52. 

Russell, R. B. (1998). Detection of protein three-dimen- 
sional side-chain patterns: new examples of conver- 
gent evolution. /. Mol. Biol 279, 1211-1227. 

Russell, R. B., Sasieni, P. D. & Sternberg, M. J. E. (1998). 
Supersites within superfolds. Binding site similarity 
in the absence of homology. /. Mol Biol 282, 903- 
918. 

Seery, L. T., Nestor, P. V. & FitzGerald, G. A. (1998). 

Molecular evolution of the aldo-keto reductase gene 

superfamily. /. Mol. Evol. 46, 139-146. 
Selkov, E., Galimova, M., Goryanin, I., Gretchkin, Y., 

Ivanova, N., Komarov, Y., Maltsev, N., Mikhailova, 



N., Nenashev, V., Overbeek, R., Panyushkina, E., 
Pronevitch, L. & Selkov, E., Jr (1997). The metabolic 
pathway collection: an update. Nucl. Acids Res. 25, 
37-38. 

Sonnhammer, E., Eddy, S. & Durbin, R. (1997). Pfam: a 
comprehensive database of protein domain families 
based on seed alignments. Proteins: Struct. Funct. 
Genet 28, 405-420. 

Tamames, J., Casari, G., Ouzounis, C. & Valencia, A. 
(1997). Conserved clusters of functionally related 
genes in two bacterial genomes. /. Mol Evol 44, 66- 
73. 

Tatusov, R. L., Koonin, E. V. & Lipman, D. J. (1997). A 
genomic perspective on protein families. Science, 
278, 631-637. 

Teichmann, S., Park, J. & Chothia, C. (1998). Struc- 
tural assignments to the proteins of Mycoplasma 
genitalium show that they have been formed by 
extensive gene duplications and domain 
rearrangements. Proc. Natl Acad. Sci. USA, 95, 
14658-14663. 

Wootton, J. C. & Federhen, S. (1996). Analysis of compo- 
sitionally biased regions in sequence databases. 
Methods Enzymol. 266, 554-571. 



Edited by G. von Heijne 



(Received 16 November 1998; received in revised form 1 March 1999; accepted 1 March 1999) 



